Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BW test on V100 4 GPUS is not matched with InfiniBand EDR (Connect-X4) #251

Open
javak87 opened this issue Sep 12, 2024 · 1 comment
Open

Comments

@javak87
Copy link

javak87 commented Sep 12, 2024

Dear developer,
Recently, I run nccl test on the following machine:
2× InfiniBand EDR (Connect-X4)
4× NVIDIA V100 GPU, 16 GB HBM

Based on my best knowledge, NCCL tests measure BW per direction. Therefore, the results would be 25 GB/s on the V100, However, I am getting 41.55 GB/s, which is significantly higher than the theoretical BW (25 GB/s).

Here is topology matrix:

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV2     NV2     NV2     PIX     SYS     0-19,40-59      0               N/A
GPU1    NV2      X      NV2     NV2     PIX     SYS     0-19,40-59      0               N/A
GPU2    NV2     NV2      X      NV2     SYS     PIX     20-39,60-79     1               N/A
GPU3    NV2     NV2     NV2      X      SYS     PIX     20-39,60-79     1               N/A
NIC0    PIX     PIX     SYS     SYS      X      SYS
NIC1    SYS     SYS     PIX     PIX     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

Here is the SLURM job that I submitted:

#SBATCH -p gpu
#SBATCH --time=00:05:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4

    srun --mpi=pspmix apptainer exec     --nv     --env NCCL_TOPO_DUMP_FILE=./projects/nccl/topology.xml     --env CUDA_VISIBLE_DEVICES=0,1,2,3     "./images/nccl_eval.sif"     ./nccl-tests/build/all_reduce_perf     -b 8     -e 8G     -f 2     -g 1     -t 1

Here is the output of this test:

=============
== PyTorch ==
=============

NVIDIA Release 24.06 (build 96418707)
PyTorch Version 2.4.0a0+f70bd71
Container image Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.5 driver version 555.42.02 with kernel driver version 550.54.15.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

mpirun (Open MPI) 4.1.7a1

Report bugs to http://www.open-mpi.org/community/help/
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  19455 on  000 device  0 [0x60] Tesla V100-SXM2-16GB
#  Rank  1 Group  0 Pid  19408 on  000 device  1 [0x61] Tesla V100-SXM2-16GB
#  Rank  2 Group  0 Pid  19525 on  000 device  2 [0x88] Tesla V100-SXM2-16GB
#  Rank  3 Group  0 Pid  19489 on  000 device  3 [0x89] Tesla V100-SXM2-16GB
#  Rank  4 Group  0 Pid  25306 on  003 device  0 [0x60] Tesla V100-SXM2-16GB
#  Rank  5 Group  0 Pid  25349 on  003 device  1 [0x61] Tesla V100-SXM2-16GB
#  Rank  6 Group  0 Pid  25419 on  003 device  2 [0x88] Tesla V100-SXM2-16GB
#  Rank  7 Group  0 Pid  25383 on  003 device  3 [0x89] Tesla V100-SXM2-16GB
#
# Reducing maxBytes to 5284866730 due to memory limitation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    25.37    0.00    0.00      0    24.43    0.00    0.00      0
          16             4     float     sum      -1    25.12    0.00    0.00      0    24.20    0.00    0.00      0
          32             8     float     sum      -1    25.06    0.00    0.00      0    24.33    0.00    0.00      0
          64            16     float     sum      -1    25.44    0.00    0.00      0    24.18    0.00    0.00      0
         128            32     float     sum      -1    24.90    0.01    0.01      0    24.97    0.01    0.01      0
         256            64     float     sum      -1    25.29    0.01    0.02      0    24.80    0.01    0.02      0
         512           128     float     sum      -1    26.07    0.02    0.03      0    25.20    0.02    0.04      0
        1024           256     float     sum      -1    26.37    0.04    0.07      0    26.69    0.04    0.07      0
        2048           512     float     sum      -1    28.39    0.07    0.13      0    27.79    0.07    0.13      0
        4096          1024     float     sum      -1    30.08    0.14    0.24      0    29.44    0.14    0.24      0
        8192          2048     float     sum      -1    31.55    0.26    0.45      0    30.79    0.27    0.47      0
       16384          4096     float     sum      -1    32.94    0.50    0.87      0    31.32    0.52    0.92      0
       32768          8192     float     sum      -1    38.32    0.86    1.50      0    36.61    0.90    1.57      0
       65536         16384     float     sum      -1    58.20    1.13    1.97      0    55.77    1.18    2.06      0
      131072         32768     float     sum      -1    66.68    1.97    3.44      0    65.17    2.01    3.52      0
      262144         65536     float     sum      -1    67.53    3.88    6.79      0    65.42    4.01    7.01      0
      524288        131072     float     sum      -1    91.23    5.75   10.06      0    90.23    5.81   10.17      0
     1048576        262144     float     sum      -1    133.4    7.86   13.76      0    132.6    7.91   13.84      0
     2097152        524288     float     sum      -1    224.6    9.34   16.34      0    225.4    9.30   16.28      0
     4194304       1048576     float     sum      -1    311.2   13.48   23.59      0    310.7   13.50   23.62      0
     8388608       2097152     float     sum      -1    574.1   14.61   25.57      0    572.9   14.64   25.62      0
    16777216       4194304     float     sum      -1   1096.5   15.30   26.78      0   1093.8   15.34   26.84      0
    33554432       8388608     float     sum      -1   1996.6   16.81   29.41      0   1992.7   16.84   29.47      0
    67108864      16777216     float     sum      -1   3799.6   17.66   30.91      0   3810.4   17.61   30.82      0
   134217728      33554432     float     sum      -1   6996.0   19.18   33.57      0   6996.8   19.18   33.57      0
   268435456      67108864     float     sum      -1    14191   18.92   33.10      0    14208   18.89   33.06      0
   536870912     134217728     float     sum      -1    22975   23.37   40.89      0    22949   23.39   40.94      0
  1073741824     268435456     float     sum      -1    45482   23.61   41.31      0    45530   23.58   41.27      0
  2147483648     536870912     float     sum      -1    90596   23.70   41.48      0    90636   23.69   41.46      0
  4294967296    1073741824     float     sum      -1   180880   23.74   41.55      0   180971   23.73   41.53      0

I also attached topology of the machine:
V100_topology.txt

I would appreciate it if you could add some comments on my findings and help me understand this discrepancy.
Thanks

@javak87 javak87 changed the title Infiniband BW is not matched with V100 BW test on V100 4 GPUS is not matched with InfiniBand EDR (Connect-X4) Sep 12, 2024
@sjeaugey
Copy link
Member

sjeaugey commented Jan 6, 2025

Two nodes is a special case where the Tree algo can get higher bandwidth (which I explained in many other issues). If you run on 4 nodes you should be back to the 24GB/s you would expect. On two nodes you can also force the Ring algorithm to get back to the 24GB/s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants