-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243
base: master
Are you sure you want to change the base?
Conversation
This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework. Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies. With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test: - MASTER_ADDR - RANK - WORLD_SIZE >> Dependencies PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands: ``` cd /tmp/ wget https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip unzip libtorch-shared-with-deps-latest.zip sudo mv libtorch /usr/local/ ``` >> Build instructions To build the NCCL test binaries supporting both MPI and C10D Gloo, use: ``` MPI=1 GLOO=1 make ``` >> Usage >>>> Run a Single 8-GPU Node NCCL Test: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` #!/bin/bash for i in {0..7}; do MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 & done wait ``` >>>> Run a Two-Node NCCL Test: Node 1: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ``` Node 2: 1. Set environment variables: ``` export NCCL_TOPO_FILE=<topo_file_location> export MASTER_ADDR=<master_node_ip_address> export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH ``` 2. Execute the test: ``` RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8 ```
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11 | ||
CXXFLAGS := -std=c++11 | ||
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++17 | ||
CXXFLAGS := -std=c++17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can force all users to move to c++17 just for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agreed. I can feature-ize the compiling to C++17 only for GLOO.
#ifdef MPI_SUPPORT | ||
MPI_Barrier(MPI_COMM_WORLD); | ||
#endif | ||
if (!use_c10d_gloo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we need a boolean and these new if statements.
We normally build separate binaries for single node and then MPI=1 builds for multiple node.
I expected we'd have to build standalone, MPI=1 and GLOO=1 binaries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This boolean helps to enforce only one transport is picked at run time, if user ever builds with both MPI=1 and GLOO=1 in one single binary.
src/common.cu
Outdated
auto options = c10d::ProcessGroupGloo::Options::create(); | ||
// Create Gloo device that binds to any interface. | ||
::gloo::transport::tcp::attr tcp_attr; | ||
tcp_attr.iface = "eth0"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the interface name hardcoded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will fix it to be configurable by an env variable. Thanks.
Use "GLOO_INTERFACE" env to specify the network interface.
This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.
Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.
With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:
PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:
To build the NCCL test binaries supporting both MPI and C10D Gloo, use:
Node 1:
Node 2: