Error distributed run #4

snash4 · 2020-06-30T04:43:07Z

Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.

_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8


Not able to figure out the cause of error. 
Please help, thanks.

The text was updated successfully, but these errors were encountered:

vperekadan · 2020-11-12T02:19:46Z

Setting NCCL_SOCKET_IFNAME solved this issue for me.

JingchaoZhang · 2021-02-11T21:51:07Z

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

vperekadan · 2021-02-14T18:24:29Z

Setting NCCL_SOCKET_IFNAME solved this issue for me.

what value did you set it to?

To my machine's network interface name

snash4 changed the title ~~Error~~ Error distributed run Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error distributed run #4

Error distributed run #4

snash4 commented Jun 30, 2020

vperekadan commented Nov 12, 2020

JingchaoZhang commented Feb 11, 2021

vperekadan commented Feb 14, 2021

Error distributed run #4

Error distributed run #4

Comments

snash4 commented Jun 30, 2020

vperekadan commented Nov 12, 2020

JingchaoZhang commented Feb 11, 2021

vperekadan commented Feb 14, 2021