Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does use PCIe peer-to-peer or NVLink between two containers that each have an isolated GPU #10070

Open
linxiaochou opened this issue Aug 20, 2024 · 3 comments

Comments

@linxiaochou
Copy link

linxiaochou commented Aug 20, 2024

I am a new user of UCX. Now have a situation where two different containers each use different GPU, and the two GPUs devices on the Host can communicate via PCIe P2P or NVLink. But in containers they can't communicate via PCIe P2P or NVLink.

I am looking how to solve this problem.

See the NVLink and Docker/Kubernetes section of the ucx-py readthedocs documentation: In order to use NVLink when running in containers using Docker and/or Kubernetes the processes must share an IPC namespace for NVLink to work correctly.

Who can answer that can UCX solve this problem? And How can this problem be solved, if at all.
Your assistance in this matter will be greatly appreciated.

@rakhmets
Copy link
Contributor

Please try to share process IDs between containers. E.g. add the following option to the command running the first docker:

--name docker_1

, and to the second CL:

--pid=container:docker_1

Then containers will share PID namespace.

@linxiaochou
Copy link
Author

@rakhmets Thank you for your reply and suggestions.

I tried your method by:
The first container:
docker run --name master -it --rm --gpus device=0 --network bridge --ipc host -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3
The second container:
docker run -it --rm --gpus device=1 --network bridge --ipc host --pid 'container:master' -v $(pwd):/data --entrypoint /bin/bash nvcr.io/nvidia/pytorch:24.01-py3

The two containers each use different GPU, following is the topology shown by nvidia-smi topo -m:

GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A
GPU1 NV12 X SYS SYS PIX PIX SYS SYS 0-15,32-47 0 N/A
GPU2 SYS SYS X NV12 SYS SYS PIX PIX 16-31,48-63 1 N/A
GPU3 SYS SYS NV12 X SYS SYS PIX PIX 16-31,48-63 1 N/A
NIC0 PIX PIX SYS SYS X PIX SYS SYS
NIC1 PIX PIX SYS SYS PIX X SYS SYS
NIC2 SYS SYS PIX PIX SYS SYS X PIX
NIC3 SYS SYS PIX PIX SYS SYS PIX X

And then run command in this container:
The first container:
torchrun --nnodes 2 --nproc_per_node 1 --node_rank 0 --master_addr 172.17.0.2 --master_port 29400 multinode.py
The second container:
torchrun --nnodes 2 --nproc_per_node 1 --node_rank 1 --master_addr 172.17.0.2 --master_port 29400 multinode.py

But as a result, the first container reported an error, and the output is as follows:

[1724241372.462397] [2a292d2c18cc:984 :0] tl_cuda_cache.c:231 UCC ERROR ipc-cache: failed to open ipc mem handle. addr:0x7fe456000000 len:16777216 err:1
Traceback (most recent call last):
File "/data/multinode.py", line 141, in
main(args.save_every, args.total_epochs, args.batch_size)
File "/data/multinode.py", line 128, in main
trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
File "/data/multinode.py", line 65, in init
self.model = DDP(self.model, device_ids=[self.local_rank])
File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 783, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 264, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[2024-08-21 11:56:17,385] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 984) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 351, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
.
.
.
Root Cause (first observed failure):
[0]:
time : 2024-08-21_11:56:17
host : 2a292d2c18cc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 984)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

And the second container is stuck with no output.

My understanding is that UCC is a communication library established based on UCX. I don't know if my understanding is wrong. If so, please tell me. Later, I looked at the code location of the UCC error, which uses the CUDA IPC interface.

Does this interface require two GPUs to be used without container splitting?
So I tried to mount both GPUs into containers using the --gpus parameter, both containers using the same two GPUs.
This time it should work. Both containers have outputs. However, nvidia-smi observed that GPU0 was used by both containers, while GPU1 was not.

So I would like to ask whether this error was caused by UCC? If so, could you please give an example of UCX?
Looking forward to your reply and suggestions.

@rakhmets
Copy link
Contributor

Yes, UCC is a communication library that provides interfaces for collective operations. UCC uses UCX as one of the possible transports for point-to-point communications.
I guess the reason two processes in different containers are using the same device is because both processes are taking the first device available on the system. E.g. you can set CUDA_VISIBLE_DEVICES=0 in one container and CUDA_VISIBLE_DEVICES=1 to force the use of different devices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants