- 配置docker环境(可选)
- docker pull nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04
- docker run --name hipress_upgrade --gpus all --network=host --ipc=host --security-opt seccomp=unconfined --storage-opt size=50G -v /home/mark/sparse_adam:/workspace -v /data:/data -it nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04 /bin/bash
- docker run --name hipress_upgrade --gpus all --network=host --ipc=host --security-opt seccomp=unconfined --device=/dev/infiniband/uverbs0 -v /data:/data -it hipress_upgrade_image /bin/bash
- docker run --name hipress_upgrade --gpus all --network=host --ipc=host --security-opt seccomp=unconfined --device=/dev/infiniband/uverbs0 --shm-size=1g --ulimit memlock=-1 -v /data:/data -it hipress_image /bin/bash
- 安装基本软件
- apt update
- apt install python3 python3-pip cmake openmpi-bin
- 上述安装过程中选择时区,需要分别输入6和70
- pip install numpy==1.20.3 scipy opencv-python
- apt install python3-opencv -y
- ln -s /usr/bin/python3 /usr/bin/python
- 安装horovod依赖:
- pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
- 下载hipress代码
- git clone https://github.com/mark14wu/hipress.git
- git submodule init deps/torch-hipress-extension src/CCO && git submodule update
- cd src/CCO
- git submodule init && git submodule update
- 编译hipress-torch扩展(可选)
- cd hipress/deps/torch-hipress-extension
- export HOROVOD_WITH_NCCL=1 HOROVOD_NCCL_HOME=/usr/local/nccl/
- bash install.sh
- 修改hipress/src/CaSync/install.sh,设置export HOROVOD_WITH_PYTORCH=1,去除export HOROVOD_WITHOUT_PYTORCH=1(如果存在)
- 编译hipress-mxnet扩展(可选)
- apt install libopenblas-dev libopencv-dev
- cd deps/mxnet-1.9.0
- mkdir build
- cd build
- cmake ..
- make -j
- cd ../python
- pip install -e .
- 修改hipress/src/CaSync/install.sh,设置export HOROVOD_WITH_MXNET=1,去除export HOROVOD_WITHOUT_MXNET=1(如果存在)
- 编译hipress本体
- cd hipress/src/CaSync
- bash install.sh
- 到hipress/src/CaSync以外路径,python -c "import horovod.torch",若无输出说明安装成功
- 开始测试吧!
https://github.com/mark14wu/hipress-examples
位于hipress-example/powersgd/hipress_mxnet文件夹
安装依赖:pip install gluoncv gluoncv2
模型代码:hipness-example/powersgd/hipress_mxnet/hipress_mxnet.py
1-8机脚本:hipness-example/powersgd/hipress_mxnet/10Gbps_VGG/
在hipress-example/powersgd/hipress_pytorch中
模型代码:hipress-example/powersgd/hipress_pytorch/hipress_pytorch.py
1-8机脚本:hipress-example/powersgd/hipress_pytorch/10Gbps_VGG_BS16/
和
hipress-example/powersgd/hipress_pytorch/10Gbps_VGG_BS32/
模型代码:hipress-example/powersgd/hipress_pytorch/hipress_pytorch_lstm.py
1-8机脚本:hipress-example/powersgd/hipress_pytorch/10Gbps_LSTM_BS80/
模型代码:hipress_pytorch_ugatit/main.py
1-8机脚本:hipress-example/powersgd/hipress_pytorch/10Gbps_UGATIT/
在hipress-example/powersgd/torchddp中
模型代码:hipress-example/powersgd/torchddp/torchddp_vgg.py
1-8机脚本:在hipress-example/powersgd/torchddp/10Gbps_VGG_BS16/
模型代码:hipress-example/powersgd/torchddp/torchddp_lstm.py
1-8机脚本:hipress-example/powersgd/torchddp/10Gbps_LSTM_BS80/
模型代码:hipress-example/powersgd/torchddp/torchddp_ugatit/main.py
1-8机脚本: hipress-example/powersgd/torchddp/10Gbps_UGATIT/
- VGG脚本需要安装tensorboardX,tqdm
- pip install tensorboardX tqdm
- 可能需要安装ssh:apt install ssh
- nano /etc/ssh/sshd_config,其中Port 22改为需要的端口(如1958)
- nano ~/.ssh/config ,配置如下
- Host *
- Port 1958
- 配置authorized_keys(把id_rsa.pub添加进去)
- service ssh restart
- 即可用ssh进行免密登陆