Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error distributed run #4

Open
snash4 opened this issue Jun 30, 2020 · 3 comments
Open

Error distributed run #4

snash4 opened this issue Jun 30, 2020 · 3 comments

Comments

@snash4
Copy link

snash4 commented Jun 30, 2020

Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.

_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8


Not able to figure out the cause of error. 
Please help, thanks. 
@snash4 snash4 changed the title Error Error distributed run Jun 30, 2020
@vperekadan
Copy link

Setting NCCL_SOCKET_IFNAME solved this issue for me.

@JingchaoZhang
Copy link

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

@vperekadan
Copy link

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

To my machine's network interface name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants