You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.
_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Not able to figure out the cause of error.
Please help, thanks.
The text was updated successfully, but these errors were encountered:
Hi,
Thanks for the easy following tutorial on distributed processing.
I followed your example, it works fine on a single multi-gpu system. On running it on multiple nodes with 2 gpus each I get an error during runtime.
_```
Traceback (most recent call last):
File "conv_dist.py", line 117, in
main()
File "conv_dist.py", line 51, in main
mp.spawn(train, nprocs=args.gpus, args=(args,), join=True)
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/dine2/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/work/codebase/torch_dist/conv_dist.py", line 74, in train
model = DDP(model, device_ids=[gpu])
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 285, in init
self.broadcast_bucket_size)
File "/dine2/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 496, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1591914838379/work/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
The text was updated successfully, but these errors were encountered: