Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between in_place and out_of_place #261

Open
17113325 opened this issue Nov 4, 2024 · 5 comments
Open

Difference between in_place and out_of_place #261

17113325 opened this issue Nov 4, 2024 · 5 comments

Comments

@17113325
Copy link

17113325 commented Nov 4, 2024

I found that there are two test patterns, one for in_place and one for out_of_place, what is the difference between these two, I also found that I need to add an offset when using in_place, why do I need to do that?

TESTCHECK(args->collTest->runColl( (void*)(in_place ? recvBuff + args->sendInplaceOffset*rank : sendBuff), (void*)(in_place ? recvBuff + args->recvInplaceOffset*rank : recvBuff), count, type, op, root, args->comms[i], args->streams[i]));
@kiskra-nvidia
Copy link
Member

@17113325
Copy link
Author

17113325 commented Nov 5, 2024

I'm sorry, I still don't understand why the offset is added to the in_place operation, could you give me a more detailed explanation, I didn't find the exact reason in the docs.

As an example, when I was doing the scatter test, I realized that if in_place didn't add an offset, I would get different results, so I was wondering why the in_place operation needed an offset

Looking forward to your reply, thanks!

@kiskra-nvidia
Copy link
Member

The offset depends on the collective used; see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html. Broadcast, Reduce, and AllReduce operate on complete buffers for both input and output, so no offset is used. Allgather's input on each rank is smaller than the complete (output) buffer, and ReduceScatter's output on each rank is smaller than the complete (input) buffer. When you want NCCL to operate in-place (without having to allocate separate buffers for input and output), the address of the "sub-buffer" (input for AllGather, output for ReduceScatter) needs to be at a specific offset within the complete buffer for the algorithm to work correctly (the offset will be different for each rank).

@17113325
Copy link
Author

17113325 commented Nov 5, 2024

Do you mean that when collective is not running with a full buffer, the offset must be added in order to run correctly, and that if the offset is not added it will result in a performance degradation? Can you explain this in detail using Scatter as an example? Regarding the Scatter use case the offset calculation is as follows:

void ScatterGetCollByteCount(size_t *sendcount, size_t *recvcount, size_t *paramcount, size_t *sendInplaceOffset, size_t *recvInplaceOffset, size_t count, int nranks) {
*sendcount = (count/nranks)*nranks;
*recvcount = count/nranks;
*sendInplaceOffset = 0;
*recvInplaceOffset = count/nranks;
*paramcount = count/nranks;
}

@kiskra-nvidia
Copy link
Member

Let me explain one more time. "in place" is a special case that may require additional considerations to work correctly. What makes it special is that the input buffer and the output buffer overlap (depending on the collective operation, they are either the same or one is a part of the other).

Every collective will work just fine "out of place", i.e., when the input and output buffers are completely separate memory regions; there are no offsets or anything else of that sort to consider then. Implementation-wise "out of place" is generally simpler/more flexible; e.g., a collective operation is free to use the output buffer during its execution as a temporary storage for partial results, since its old content is irrelevant/discarded anyway.

Because with "in place" the input and output overlap, the collective operation may need to be implemented differently, making sure that for the overlapping region it reads the input data before overwriting it with any output. In particular, NCCL requires that, if the buffers overlap, they do so in a very predictable, "natural" way.

For example, imagine that you have an array of 2048 floats, but only the first 1024 contain a valid input. Now consider the following calls:

// Out-of-place
ncclAllReduce(&buffer[0], &buffer[1024], 1024, ncclFloat, ncclSum, comm, stream);

// In-place
ncclAllReduce(&buffer[0], &buffer[0], 1024, ncclFloat, ncclSum, comm, stream);

// ???
ncclAllReduce(&buffer[0], &buffer[512], 1024, ncclFloat, ncclSum, comm, stream);

The first call is out-of-place: the input and output buffers are non-overlapping.

Because for the AllReduce operation the size of the input and output are the same, NCCL requires that, for in-place operation, the overlap is complete: the address of the input and output buffers must be the same. That's what the second call does.

Now look at the third one: here there's a partial overlap -- the second half of the input buffer is also the first half of the output buffer. That's not supported -- you are likely to get a corrupted result.

You asked about Scatter. Truth be told, it's not the best example, because for this collective only one rank contains the input while all the other ranks are output-only. In fact, NCCL doesn't even have a native Scatter operation -- nccl-tests implement it using ncclSend/ncclRecv, so all bets are off on what works and what doesn't. NCCL does have a somewhat similar ReduceScatter, however, so I'm going to use its picture (borrowed from https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html#reducescatter):

reduce-scatter

(for Scatter, the input would be on only one rank -- the root -- but the output is the same for both operations)

So for out-of-place variant those out0, out1, ..., outN buffers could be anywhere in memory, so long as they don't overlap any part of the input buffer. For in-place, however, the output buffer argument on each rank needs to point to the specific place in the input buffer as seen in the above picture. E.g., if the input buffer is 4096 floats and there are 4 ranks, the output buffer argument on rank 0 needs to point at element 0, on rank 1 at element 1024, on rank 2 at element 2048, and on rank 3 at element 3072. NCCL will then recognize it as a valid in-place operation. Any other overlap is unsupported and could cause data corruption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants