Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined symbol: ncclCommRegister #64

Open
MC952-arch opened this issue Oct 21, 2024 · 1 comment
Open

Undefined symbol: ncclCommRegister #64

MC952-arch opened this issue Oct 21, 2024 · 1 comment

Comments

@MC952-arch
Copy link

Hi, I've encountered a msccl issue using the latest nccl/nccl-test/msccl repo for allreduce test.

// msccl install step
git clone https://github.com/microsoft/msccl.git
cd msccl/
make -j src.build
cd ../

// nccl install step
git clone https://github.com/nvidia/nccl-tests.git
cd nccl-tests/
make MPI=1 NCCL_HOME=../msccl/build/ -j
cd ../

// msccl-tools install step
git clone https://github.com/microsoft/msccl-tools.git
cd msccl-tools/
pip install .
cd ../
python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
cd ../

// allreduce test
mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

// Error
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclCommRegister

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[14987,1],0]
Exit code: 127

Can you help me figure out this issue?

@MoFHeka
Copy link

MoFHeka commented Nov 12, 2024

checkout older nccl-test version or define macro NCCL_VERSION_CODE lower than 21900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants