DADT is A Decentrilized Asynchronously Distributed Training framework for Pytorh. DADT use asynchronously AllReduce algorithm to exchange gradients. So it has AllReduce and Asynchronous's advantage at the same time. Refer to 3 paper: Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training, Asynchronous Decentralized Parallel Stochastic Gradient Descent and EXASCALE DEEP LEARNING FOR SCIENTIFIC INVERSE PROBLEMS.
Perf test compare with Horvod, using Resnet101/Resnet50.
Horovod Throughtput | DADT Throughput | Increase | |
Resnet101, P40, 4GPU | 213 | 301.8 | 41.6% |
Resnet50, P40, 4GPU | 118.5 | 144 | 21.5% |
DADT run on Linux platform.
- step 1: install CUDA and NCCL.
- step 1: install Pytorh.
- step 2: install MPI, OpenMPI or Intel MPI etc.
- step 3: clone the project. open file
to beTrue
if want build for GPU. use below command to build and install DADT.
# set be True for build for GPU
# install command
python install
DADT depend on Pytorh, it must embeded in Pytorh code. below is a mnist example.
# import
import dadt.Pytorh as dadt
def main(unused_argv):
# at beginning init DADT
# cycle_duration_ms: background thread do AllReduce periodically, cycle_duration_ms control how many time between every step.
# broad_cast_executor: broadcast executor, at beginning rank-0 will broadcast weight to other ranks. use broad_cast_executor to decide which broadcast executor will be used.
# 'mpi': use MPI do broadcast
# 'nccl': use CUDA NCCL do broadcast
# 'mpicuda': use MPI to do broadcast and MPI must be [GPU-aware](
# When traing on CPU must choose 'MPI'.
# all_reduce_executor: same like broad_cast_executor this flag decide whick executor use to do AllReduce.
# all_reduce_buffer_size: AllReduce fusion buffer, default is 64MB, set to be 0 means do not use buffer.
# group_buffer_size: used to collect ioslate tensor to a Group.
# wrapper the optimizer
optimizer = optim.Adadelta(model.parameters(),
distribte_optimizer = dadt.DistributedOptimizer(optimizer=optimizer)
# tain like nornaml pytorch
below is run command using MPI, it will create 2 rank in same machine for training.
mpirun -np 2 python