We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, I've encountered a msccl issue using the latest nccl/nccl-test/msccl repo for allreduce test.
// msccl install step git clone https://github.com/microsoft/msccl.git cd msccl/ make -j src.build cd ../
// nccl install step git clone https://github.com/nvidia/nccl-tests.git cd nccl-tests/ make MPI=1 NCCL_HOME=../msccl/build/ -j cd ../
// msccl-tools install step git clone https://github.com/microsoft/msccl-tools.git cd msccl-tools/ pip install . cd ../ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml cd ../
// allreduce test mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Can you help me figure out this issue?
The text was updated successfully, but these errors were encountered:
checkout older nccl-test version or define macro NCCL_VERSION_CODE lower than 21900
Sorry, something went wrong.
No branches or pull requests
Hi, I've encountered a msccl issue using the latest nccl/nccl-test/msccl repo for allreduce test.
// msccl install step
git clone https://github.com/microsoft/msccl.git
cd msccl/
make -j src.build
cd ../
// nccl install step
git clone https://github.com/nvidia/nccl-tests.git
cd nccl-tests/
make MPI=1 NCCL_HOME=../msccl/build/ -j
cd ../
// msccl-tools install step
git clone https://github.com/microsoft/msccl-tools.git
cd msccl-tools/
pip install .
cd ../
python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
cd ../
// allreduce test
mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0
// Error
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[14987,1],0]
Exit code: 127
Can you help me figure out this issue?
The text was updated successfully, but these errors were encountered: