Gather and Scatter for FabArrays? #4284

uschille · 2025-01-04T15:58:26Z

uschille
Jan 4, 2025

I have a loop with loop dependencies and dependencies on values in other boxes, hence I am trying to implement a global loop over the whole domain instead of ParallelFor to avoid race conditions. I think what I need is the equivalent of MPI_Gather and MPI_Scatter for FabArray. My initial attempt was something like

void borkbork(const Box& domain, FabArray<BaseFab<GpuComplex<Real>>>& kspace_noise) {

  BaseFab<GpuComplex<Real>> kspace_noise_whole(domain, ndof);

  // Gather the whole array
  kspace_noise.copyTo(kspace_noise_whole, 0, 0, ndof);

  for (int i=0; i<kspace_noise.nComp(); ++i)
      for (int kz=0; kz<domain.length(2); ++kz)
        for (int ky=0; ky<domain.length(1); ++ky)
          for (int kx=0; kx<domain.length(0); ++kx)
            {/*modify values in kspace_noise_whole here*/}

  // Scatter kspace_noise_whole back to kspace_noise
  for (MFIter mfi(kspace_noise); mfi.isValid(); ++mfi) {
    kspace_noise.get(mfi).copy(kspace_noise_whole, 0, 0, ndof);
  }

}

When compiling with CUDA, the call to copy in the last for loop gives me an error: no instance of overloaded function "amrex::BaseFab<T>::copy [with T=amrex::GpuComplex<amrex::Real>]" matches the argument list , which makes me think that I'm doing something wrong.

Is this use of copyTo and copy viable or do I need to specify a global BoxArray and DistributionMaoping explicitly and use ParallelCopy?
Is there a recommended way to "gather" a FabArray, do some global modifications on it, and then "scatter" it back? The answer in MPI Related Question #2224 suggested storing a 1D vector and using multiple non-owning Fabs, but that's from three years ago and perhaps other features are available now?

Answered by WeiqunZhang

Jan 7, 2025

I assume only one process needs to do the work. Right? You could do something like this.

BoxArray ba(domain);
DistributionMapping dm{Vector<int>{0}}
FabArray<BaseFab<GpuComplex<Real>>> fa_whole(ba, dm, ndof, 0);
fa_whole.ParallelCopy(kspace_noise);
// Work on fa_whole
kspace_noise.ParallelCopy(fa_whole);

You will need to use managed memory.

I am also curious. What kind of data dependencies do you have?

(Re: the error. copy to template copy<RunOn::Device> or RunOn::Host should be able to fix the error. When compiling with CUDA, we force the user to make a choice on where to run Fab level functions, which helps to eliminate a lot of bugs due to synchronization issues.)

View full answer

WeiqunZhang · 2025-01-07T02:21:48Z

WeiqunZhang
Jan 7, 2025
Maintainer

I assume only one process needs to do the work. Right? You could do something like this.

BoxArray ba(domain);
DistributionMapping dm{Vector<int>{0}}
FabArray<BaseFab<GpuComplex<Real>>> fa_whole(ba, dm, ndof, 0);
fa_whole.ParallelCopy(kspace_noise);
// Work on fa_whole
kspace_noise.ParallelCopy(fa_whole);

You will need to use managed memory.

I am also curious. What kind of data dependencies do you have?

(Re: the error. copy to template copy<RunOn::Device> or RunOn::Host should be able to fix the error. When compiling with CUDA, we force the user to make a choice on where to run Fab level functions, which helps to eliminate a lot of bugs due to synchronization issues.)

0 replies

uschille · 2025-01-07T13:16:58Z

uschille
Jan 7, 2025
Author

Thank you @WeiqunZhang, that points me in the right direction.

(Re: the error. copy to template copyRunOn::Device or RunOn::Host should be able to fix the error. When compiling with CUDA, we force the user to make a choice on where to run Fab level functions, which helps to eliminate a lot of bugs due to synchronization issues.)

Ah yes, I found that in the documentation, sorry I missed it earlier.

I am also curious. What kind of data dependencies do you have?

My for loop essentially looks like this:

  for (int i=0; i<kspace_noise.nComp(); ++i) {
    for (int kz=0; kz<domain.length(2); ++kz) {
      for (int ky=0; ky<domain.length(1); ++ky) {
        for (int kx=0; kx<domain.length(0); ++kx) {
          int kxloc = (kx == 0) ? 0 : domain.length(0) - kx;
          int kyloc = (ky == 0) ? 0 : domain.length(1) - ky;
          int kzloc = (kz == 0) ? 0 : domain.length(2) - kz;
          if (    ( (kz > domain.length(2)/2) )
                || ( (ky > domain.length(1)/2) && (kz == kzloc) )
                || ( (kx > domain.length(0)/2) && (ky == kyloc) && (kz == kzloc)) ) {
            xi(kx,ky,kz,i).m_real =  xi(kxloc,kyloc,kzloc,i).real();
            xi(kx,ky,kz,i).m_imag = -xi(kxloc,kyloc,kzloc,i).imag();
          }
        }
      }
    }
  }

The background is that we construct colored noise in Fourier-space that needs to satisfy complex conjugate symmetries. I am actually trying to eliminate the dependencies since later on we pass only half the domain to FFT::R2C.backward(). There are still some points in the plane that contains k=(0,0,0) that I'm not sure about and I'll have to dig into the fftw documentation to figure out how I can handle those. In any case, I'd like to do some regression tests and your answer is very helpful for that. Btw, thanks for FFT::R2C - it's extremely useful for us!

6 replies

WeiqunZhang Jan 7, 2025
Maintainer

Then the FFT call is

    FFT::R2C<Real,FFT::Direction::backward> fft(domain);
    fft.backward(cmf, rmf);

uschille Jan 7, 2025
Author

That's brilliant, thank you very much.

If we do 1D domain decomposition in x-direction, we can handle the symmetry in the y-z plane locally.

Yes, the 1D domain decomposition is the key. And I could then use rmf_other_dm.Redistribute(rmf, scomp, dcomp, nghost) if I needed to add rmf to other MFs with a different distribution mapping, correct?

(Thinking about it, I could use ParallelAdd and I can probably use 1D domain decomposition for my other MFs as well and use MultiFab::Add)

uschille Jan 7, 2025
Author

In fact, it seems I can pass MFs with different distribution mapping to fft.backward()?

WeiqunZhang Jan 7, 2025
Maintainer

That's right. The distributionmapping for the input and out do not need to be the same. The BoxArrays can be chopped into whatever way you like, and there is no constraint on the number of boxes per process.

uschille Jan 7, 2025
Author

Sweet, I like FFT::R2C even more now 😃

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gather and Scatter for FabArrays? #4284

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Gather and Scatter for FabArrays? #4284

uschille Jan 4, 2025

Replies: 2 comments · 6 replies

WeiqunZhang Jan 7, 2025 Maintainer

uschille Jan 7, 2025 Author

WeiqunZhang Jan 7, 2025 Maintainer

uschille Jan 7, 2025 Author

uschille Jan 7, 2025 Author

WeiqunZhang Jan 7, 2025 Maintainer

uschille Jan 7, 2025 Author

uschille
Jan 4, 2025

Replies: 2 comments 6 replies

WeiqunZhang
Jan 7, 2025
Maintainer

uschille
Jan 7, 2025
Author

WeiqunZhang Jan 7, 2025
Maintainer

uschille Jan 7, 2025
Author

uschille Jan 7, 2025
Author

WeiqunZhang Jan 7, 2025
Maintainer

uschille Jan 7, 2025
Author