Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: nvls all reduce correction factor #239

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

OrenLeung
Copy link
Contributor

@OrenLeung OrenLeung commented Jul 24, 2024

I was running single server H100 (8xH100 SXM) nccl-tests and saw that the Bus BW 480Gbyte/s even tho the line rate is 450Gbyte/s. I was confused and looked further into how bus BW is calcuated and it seems like it is calculated incorrectly for in network reduction algos.

According to #212 (comment) , The acutal correction factor should be bus_bw = algo_bw * (n-1)/(n+1) instead of bus_bw = algo_bw * 2(n-1)/n

This PR is probably not mergable since NCCL_ALGO can be auto picked or be contained in /etc/nccl.conf and there doesn't seem to have an API for seeing what algo nccl has chose. Correction factors for CollnetDirect and CollnetChain on the IB Network probably needs to be updated too.

But just wanted to put it here in case anyone else in the community is confused about how bus bw could be 106% faster than peak theoretical line rate.

Command

NCCL_ALGO=NVLS ./build/all_reduce_perf -b 8K -e 8G -f 2 -g 8

Before

image

After

image

Factor vs number of ranks

image

NVLS read/write

image

@sjeaugey
Copy link
Member

sjeaugey commented Sep 2, 2024

Sorry, my comment was incorrect. I fixed it. It's algobw = busbw * n / (n+1).

@sjeaugey
Copy link
Member

sjeaugey commented Sep 2, 2024

Also note, the slide above is incorrect as well. It should read N-1 reads / N-1 writes in the left column, for a total of 2(N-1) send, 2(N-1) receive.

@OrenLeung
Copy link
Contributor Author

Sorry, my comment was incorrect. I fixed it. It's algobw = busbw * n / (n+1).

@sjeaugey thank for the clarification. For NVLSTree, what would the correction factor be?

@sjeaugey
Copy link
Member

sjeaugey commented Sep 3, 2024

Well, this is where things get complicated. NVLSTree is using NVLS intra-node, but we use Tree inter-node. Tree is near-bandwidth-optimal, it exchanges 2×size instead of 2×(n-1)/n×size, except for 2 nodes in which case it only exchanges size. So now we have a mix of intra-node NVLS and inter-node Tree, and the performance will depend on whichever is the bottleneck. On 2 nodes it will be NVLS, on 4+ nodes it will be Tree.

And things can get worse. In case of intra-node + inter-node, part of the intra-node traffic may be lightened because the inter-node part plays the role of one of the intra-node steps, meaning things are going to be really complicated to compute. That's the case for rings, which are limited to 370 GB/s intra-node but when combining intra+inter node, we only perform 7/8 steps, hence the network becomes the bottleneck at 395GB/s. [That being said, as we limit ourselves to 16 SMs to limit SM and memory usage, we won't reach that peak BW with default settings].

That's why trying to track how much bandwidth is going through each NVLink, PCI link, Network port is a very complex task, and not something you can easily reflect in a benchmark.

The notion of BusBW, as we defined it, is a theoretical correction factor, based on what's needed to communicate between ranks when using point-to-point transfers. When using point-to-point communication, it gives a constant target as we scale instead of degrading, similarly to the broadcast operation which always had a natural notion of bandwidth.

But except in simple cases like rings on a flat homogeneous topology, it does not really reflect the "bus" bandwidth (which isn't surprising given there are many different buses). It may reflect some mix-of-speeds of the different buses, and in the case of accelerators like SHARP it doesn't mean much anymore, since the algo bandwidth is now what should be constant at scale. But when we combine SHARP with non-SHARP, if the non-SHARP becomes the bottleneck, then it may make sense again.

So you can consider the "Bus BW" as another bandwidth computation with a correction factor to make more sense in some cases. We can still compare the BusBW or ring vs NVLS to see how much faster one is versus the other. When NVLS gets 480GB/s BusBW on 8 GPUs, it means that you would need 480GB/s of NVLink bandwidth to get the same performance with a Ring or Alltoall algorithm.

Hope that explains what the goal of the "BusBw" is, and why we don't try to improve NCCL perf tests to reflect the real bandwidth of all buses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants