+如果是以上问题,解决后需要重启Orion Server才可以生效。
+Orion Client的安装设置问题,用户可以重新参考quick-start中[使用本地Docker容器](container.md)章节。防火墙问题可以参考[本附录相应小节](#firewall)。
+# 场景一:Docker 容器中使用本地节点GPU资源
+* 单台工作站,配备一张NVIDIA GTX 1080Ti显卡,显存11GB
+* Ubuntu 16.04 LTS
+* Docker CE 18.09
+完成Orion vGPU软件部署后,我们将在普通的Docker容器(没有将物理GPU直通进容器内部,不依赖于`nvidia-docker`)中启动 Juypter Notebook,运行TensorFlow1.12,使用Orion vGPU资源进行 pix2pix 模型训练与推理。
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装并正常启动
+## **Orion Server 配置与启动**
+正如[Orion Server服务配置](README.md#server-config)中所介绍的,我们需要配置`/etc/orion/server.conf`文件中的两类属性:
+* Orion Server所接受的与Client通信的数据通路,即`bind_addr`
+* Orion Server运行的模式
+### Orion Server数据通路设置
+属性`bind_addr`指Orion Server所接受的数据通路,Client必须要能访问这一地址。对于本地容器环境来说,最简单的方法是在`docker run`的时候带上`--net host`参数,使得容器内部和物理机共享网络环境。此时,`bind_addr`使用默认的``即可。
+当然,这样的便利性是建立在牺牲容器和操作系统之间网络隔离的基础上。感兴趣的读者可以参考本场景最后一小节[使用独立Docker子网](#docker-native),在不使用`--net host`参数的情况下在容器中使用Orion vGPU资源。
+### Orion Server模式设置
+### 启动Orion Server
+ listen_port = 9960
+ bind_addr =
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false
+更改了配置文件后,我们启动/重启Orion Server服务:
+systemctl restart oriond
+此时,我们可以使用`orion-check`工具检查Orion vGPU软件的状态:
+orion-check runtime server
+# (omit output)
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [Yes]
+Enable RDMA [No]
+Enable Local QEMU-KVM with SHM [No]
+Binding IP Address :
+Listening Port : 9960
+Testing the Orion Server network ...
+Orion Server can be reached through
+# (omit output)
+Orion Controller addrress is set as in configuration file. Using this address to diagnose Orion Controller
+Address is reached.
+Orion Controller Version Infomation : api_version=0.1,data_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+正常情况下,每块物理GPU会被虚拟化为4块Orion vGPU,它们都应该处于可用的状态。
+## **创建用于通信的共享内存**
+为了Orion Server和容器中的Client应用能够通过共享内存加速数据传输,我们需要在启动容器之前创建一块共享内存,在后续`docker run`时通过`-v`参数挂载到容器中。
+上述命令会在`/dev/shm`目录下创建一块`128MB`的共享内存`/dev/shm/orionsock0`,可以通过`ls /dev/shm/`命令检查。
+* 如果删除或覆盖了已经被使用过的`/dev/shm/orionsock` 共享内存,一定要重启Orion Server服务;
+* 如果同时运行多个container,每个container需要挂载单独的`/dev/shm/orionsock`共享内存。这些共享内存可以通过`orion-shm -i `来分别创建。
+## **Orion Client容器启动**
+### 获取带有Orion Client运行时环境的容器
+我们提供配置好Orion Client runtime的预先安装好官方原生TensorFlow 1.12的Docker镜像,以Python3.5版本为例:
+docker pull virtaitech/orion-client:tf1.12-py3
+### Orion Client端参数配置
+对于Orion Client端容器来说,需要设置以下环境变量:
+* `ORION_CONTROLLER=:9123`: Orion Client向Orion Controller申请Orion vGPU资源时发送RESTful API的网络地址。在本场景中,容器共享操作系统网络,所以`controller_ip`可以设置为``即可。
+* `ORION_VGPU`:容器中每个进程申请的Orion vGPU数目,默认情况下,最多可以申请4倍物理GPU数目的vGPU。本例中我们设置`ORION_VGPU=1`。
+* `ORION_GMEM`:申请的每个Orion vGPU所能使用的显存数目(单位:MB)。由于我们使用一张GTX 1080Ti显卡,显存上限为11G,我们设置`ORION_GMEM=10500`。
+### 启动容器
+如上文所述,我们在用`docker run`命令启动容器时,需要用`-e`参数设置上一节介绍的环境变量,并用`-v`参数将创建的`/dev/shm/orionsock0`共享内存挂载到容器内的`/dev/shm`目录下。为了方便运行Jupyter Notebook训练TensorFlow官方提供的`pix2pix`模型例子,我们假设执行`docker run`的目录下已经用
+git clone https://github.com/tensorflow/tensorflow.git
+docker run -it --rm \
+ -v /dev/shm/orionsock0:/dev/shm/orionsock0:rw \
+ -v $(pwd)/tensorflow:/root/tensorflow \
+ --net host \
+ -e ORION_VGPU=1 \
+ -e ORION_GMEM=10500 \
+ virtaitech/orion-client:tf1.12-py3
+读者可以在容器中通过`ls /dev | grep nvidia`确认容器中没有挂载NVIDIA显卡设备。
+在运行Jupyter Notebook之前,我们可以用`orion-check`工具检查Orion Client容器内部是否能正常与Orion Controller通信:
+# From inside Orion Client container
+orion-check runtime client
+# (omit output)
+Environment variable ORION_CONTROLLER is set as Using this address to diagnose Orion Controller.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+这样的输出说明Orion Client容器内部应用可以向Orion Controller申请资源。否则,用户应该先根据[Orion Controller安装部署](README.md#controller)章节检查Orion Controller的状态,再根据上文检查是否设置了正确的`ORION_CONTROLLER=:9123`。
+## **运行Jupyter Notebook**
+# From inside Orion Client container
+cd tensorflow/tensorflow/contrib/eager/python/examples/
+jupyter notebook --no-browser --allow-root
+To access the notebook, open this file in a browser:
+ file:///root/.local/share/jupyter/runtime/nbserver-26-open.html
+Or copy and paste one of these URLs:
+ http://localhost:8888/?token=
+ssh -Nf -L 8888:localhost:8888
+然后在本地浏览器里面输入地址访问Jupyter Notebook
+## **使用TensorFlow 1.12 Eager Execution模式进行 pix2pix 模型训练与推理**
+从图中可以看到,真正使用物理GPU的进程是Orion Server的进程`oriond`,而不是容器中正在执行TensorFlow训练任务的Python脚本。这是因为容器中的应用程序使用的是Orion vGPU资源,对物理GPU的访问完全由Orion Server所接管。
+## (可选)使用独立的Docker子网
+本小节中,我们展示当容器使用独立的Docker子网时应当如何设置各项参数。我们假定读者已经熟悉并成功完成了本章前面介绍的,当容器使用`--net host`参数启动时使用Orion vGPU的整套流程,因此在本小节中我们只列出与前文不同之处。
+### Orion Server 数据通路设置
+设置数据通路`bind_addr`的关键,在于确保Orion Client端可以通过这一地址与Orion Server进行数据交互。对于现在的场景,在容器内的应用默认只能访问Docker子网,所以我们需要把`bind_addr`设为Docker子网的网关(gateway)。
+docker0 Link encap:Ethernet HWaddr 02:42:46:9f:27:13
+ inet addr: Bcast: Mask:
+ inet6 addr: fe80::42:46ff:fe9f:2713/64 Scope:Link
+ RX packets:416541 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:652846 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:0
+ RX bytes:24865042 (24.8 MB) TX bytes:3116550526 (3.1 GB)
+### Orion Server 参数配置示例
+ listen_port = 9960
+ bind_addr =
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false
+systemctl restart oriond
+### Orion Client 参数设置
+这里我们将Orion Controller地址设置为Docker子网网关地址:`ORION_CONTROLLER=`,从而确保容器中的应用程序向Orion Controller发送的资源请求可以被Orion Controller收到。
+### 运行容器
+除了上述修改外,为了容器外部可以在浏览器中访问Jupyter Notebook,我们还需要将`8888`端口通过`-p 8888:8888`暴露到外部。最终启动容器的命令为:
+docker run -it --rm \
+ -v /dev/shm/orionsock0:/dev/shm/orionsock0:rw \
+ -v $(pwd)/tensorflow:/root/tensorflow \
+ -p 8888:8888 \
+ -e ORION_VGPU=1 \
+ -e ORION_GMEM=10500 \
+ virtaitech/orion-client:tf1.12-py3
+同样,我们需要在容器内部检查是否能向Orion Controller申请资源:
+# From inside Orion Client container
+orion-check runtime client
+# (omit output)
+Environment variable ORION_CONTROLLER is set as Using this address to diagnose Orion Controller.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+这样的输出说明Orion Client容器内部应用可以向Orion Controller申请资源。否则,用户应该先根据[Orion Controller安装部署](README.md#controller)章节检查Orion Controller的状态,再根据上文检查是否设置了正确的`ORION_CONTROLLER=:9123`。
+## **运行Jupyter Notebook**
+假定tensorflow文件夹已经被挂载进容器内部。此时,我们运行Juypter Notebook时要显式指定`--ip=`,
+cd tensorflow/tensorflow/contrib/eager/python/examples/
+jupyter notebook --ip= --no-browser --allow-root
+ssh -Nf -L 8888:localhost:8888
+从而最终可以在浏览器中通过`localhost:8888/?token=`访问Jupyter Notebook。
\ No newline at end of file
+# 场景二:KVM 虚拟机中使用本地节点GPU资源
+* 单台服务器,配备两张NVIDIA Tesla V100计算卡,每张16GB显存
+* Ubuntu Server 16.04 LTS
+* Docker CE 18.09
+* libvirt 1.3.1
+* QEMU 2.5.0
+我们以安装ubuntu16.04操作系统的一台虚拟机`ubuntu-client0`作为Orion Client。这台虚拟机既没有将物理机上的显卡以直通(Passthrough)的方式穿透进来,也没有安装NVIDIA驱动或CUDA组件。我们安装了必要的Python3库,以及TensorFlow 1.12 GPU版本:
+# From inside VM
+sudo apt install python3-dev python3-pip
+sudo pip3 install tensorflow-gpu==1.12.0
+由于虚拟机中不能访问GPU,也没有NVIDIA的软件环境,TensorFlow目前是无法使用的。在配置好Orion vGPU软件后,虚拟机中的TensorFlow就可以使用Orion vGPU进行模型的训练与推理。
+完成Orion vGPU软件部署后,我们将在`ubuntu-client0`中运行TensorFlow官方的CIFAR10_Estimator示例,使用两块Orion vGPU (分别位于两块物理GPU上)进行模型训练与推理。
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装并正常启动
+## **Orion Server 配置启动**
+在启动Orion Server服务之前,我们需要修改配置文件,设置数据通路,并打开Orion Server对KVM的支持。
+### 数据通路设置
+属性`bind_addr`指Orion Server所接受的数据通路,Client必须要能访问这一地址。对于KVM虚拟机来说,我们需要将其设置为KVM虚拟机网络的网关地址。
+# From host OS
+sudo virsh domifaddr ubuntu-client0
+ Name MAC address Protocol Address
+ vnet1 52:54:00:04:82:10 ipv4
+ Name MAC address Protocol Address
+ vnet0 52:54:00:04:82:10 ipv4
+ vnet1 52:54:00:c5:43:10 ipv4
+### Orion Server模式设置
+本场景中,我们仍然选择本地共享内存来加速数据传输,因此需要设置`enable_shm=true`,`enable_rdma=false`。此外,我们要显式启用Orion vGPU软件对KVM虚拟机的支持,即设置`enable_kvm=true`。
+### Orion Server 参数配置示例
+ listen_port = 9960
+ bind_addr =
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "true"
+### 启动Orion Server
+我们需要重启Orion Server使新配置生效,并通过`orion-check`工具进一步确认Orion Server和Orion Controller可以正常交互:
+# From host OS
+sudo systemctl restart oriond
+sudo orion-check runtime server
+``` bash
+Searching NVIDIA GPU ...
+CUDA driver 418.67
+418.67 is installed.
+2 NVIDIA GPUs are found :
+ 0 : Tesla V100-PCIE-16GB
+ 1 : Tesla V100-PCIE-16GB
+Checking NVIDIA MPS ...
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [Yes]
+Enable RDMA [No]
+Enable Local QEMU-KVM with SHM [Yes]
+Binding IP Address :
+Listening Port : 9960
+Testing the Orion Server network ...
+Orion Server can be reached through
+Checking Orion Controller status ...
+[Info] Orion Controller setting may be different in different SHELL.
+[Info] Environment variable ORION_CONTROLLER has the first priority.
+Orion Controller addrress is set as in configuration file. Using this address to diagnose Orion Controller
+Address is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+可以看到,我们的Orion Server节点上有两块Tesla V100计算卡,Orion Controller将它们虚拟化成了一共8块Orion vGPU。
+## 在虚拟机内安装Orion Client运行时
+### 安装至默认路径
+在虚拟机中,我们运行Orion Client安装包:
+# From inside VM
+sudo ./install-client
+此时,用户没有指定安装路径,安装包会询问是否将Orion Client运行时安装到默认路径`/usr/lib/orion`下。得到用户许可后,安装包会通过`ldconfig`机制将Orion Client运行时添加到操作系统动态库搜索路径。
+Orion client environment will be installed to /usr/lib/orion
+Do you want to continue [n/y] ?y
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+Orion vGPU client environment has been installed in /usr/lib/orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/usr/lib/orion:$LD_LIBRARY_PATH
+由于安装包已经配置搜索路径,这里屏幕提示的`export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH`不是必需的。
+### (可选)安装至自定义路径
+# From inside VM
+sudo mkdir -p $INSTALLATION_PATH
+sudo ./install-client -d $INSTALLATION_PATH
+这种情形下,安装包会直接将Orion Client运行时安装到用户指定的`INSTALLATION_PATH=/orion`路径下,并向屏幕输出下列提示:
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+Orion vGPU client environment has been installed in /orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+用户在terminal内运行应用程序之前,一定要保证Orion Client运行时在操作系统动态库搜索路径中:
+# From current working terminal inside VM
+export LD_LIBRARY_PATH=/usr/local/orion:$LD_LIBRARY_PATH
+注意这条命令只对当前terminal生效。为方便起见,用户可以将上述语句加至`~/.bashrc`的最后一行,然后用`source ~/.bashrc`使其生效,此后登录虚拟机便不需要反复设置。
+## Orion Client参数配置
+正如[使用Docker容器](#container.md)中所介绍的,Orion Client端需要向Orion Controller发送对Orion vGPU资源的申请。对于容器环境,我们是通过在启动容器时用`ORION_CONTROLLER=:9123`环境变量设置Orion Controller的地址。对于KVM虚拟机来说,我们可以更改`/etc/orion/client.conf`来达到参数配置的目的。
+由于Orion Controller监听在物理机上的``上,我们将`controller_addr`设置为虚拟机子网网关地址即可:
+ controller_addr =
+# From inside VM
+orion-check runtime client
+如果Orion Client虚拟机内部可以连接到Orion Controller,输出为:
+# (omit output)
+Orion Controller addrress is set as in configuration file. Using this address to diagnose Orion Controller
+Address is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+## 运行TF官方CIFAR10_Estimator示例
+在运行应用程序之前,我们用环境变量指定应用程序向Orion Controller申请的Orion vGPU数目与显存:
+export ORION_VGPU=2
+export ORION_GMEM=12000
+我们的每一块Tesla V100计算卡有16GB显存,因此如果用户将`ORION_GMEM`设置得少于8GB,两块Orion vGPU会被调度到同一块物理GPU上。这里我们设置Orion vGPU的显存为12000MB,那么两块Orion vGPU将分别调度到两块物理GPU上,方便我们展示双卡的模型训练。
+下面我们使用TensorFlow官方的CIFAR10 Estimator例子展示模型的训练与推理:
+首先,我们`git clone`TF官方模型repo:
+# From inside VM
+git clone https://github.com/tensorflow/models
+然后进入CIFAR10 Estimator文件夹内
+cd models/tutorials/image/cifar10_estimator/
+mkdir data
+python3 generate_cifar10_tfrecords.py --data-dir ./data
+user@ubuntu-client0:~/models/tutorials/image/cifar10_estimator/data$ ls
+cifar-10-batches-py cifar-10-python.tar.gz eval.tfrecords train.tfrecords validation.tfrecords
+下面我们使用两块Orion vGPU进行模型训练,每块Orion vGPU上的batch_size设为128,总共256:
+python3 cifar10_main.py \
+ --data-dir=${PWD}/data \
+ --job-dir=/tmp/cifar10 \
+ --variable-strategy=GPU \
+ --num-gpus=2 \
+ --train-steps=10000 \
+ --train-batch-size=256 \
+ --learning-rate=0.1
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+2019-06-25 15:43:43.493814: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2019-06-25 15:43:43.493882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:00:09.0
+totalMemory: 11.72GiB freeMemory: 11.72GiB
+2019-06-25 15:43:43.604945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2019-06-25 15:43:43.605002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:00:09.0
+totalMemory: 11.72GiB freeMemory: 11.72GiB
+2019-06-25 15:43:43.606527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
+2019-06-25 15:43:43.606568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
+2019-06-25 15:43:43.606577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
+2019-06-25 15:43:43.606582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
+2019-06-25 15:43:43.606589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
+2019-06-25 15:43:43.606657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11400 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
+2019-06-25 15:43:43.607202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 11400 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
+# (omit output)
+INFO:tensorflow:global_step/sec: 14.2797
+INFO:tensorflow:loss = 0.48649728, step = 9900 (7.003 sec)
+INFO:tensorflow:learning_rate = 0.1, loss = 0.48649728 (7.003 sec)
+INFO:tensorflow:Average examples/sec: 3639.99 (4009.58), step = 9900
+INFO:tensorflow:Average examples/sec: 3640.07 (3717.26), step = 9910
+INFO:tensorflow:Average examples/sec: 3640.09 (3655.01), step = 9920
+INFO:tensorflow:Average examples/sec: 3640.31 (3873.63), step = 9930
+INFO:tensorflow:Average examples/sec: 3640.45 (3788.08), step = 9940
+INFO:tensorflow:Average examples/sec: 3640.79 (4017.58), step = 9950
+INFO:tensorflow:Average examples/sec: 3641.19 (4089.74), step = 9960
+INFO:tensorflow:Average examples/sec: 3641.23 (3679.08), step = 9970
+INFO:tensorflow:Average examples/sec: 3641.43 (3847.37), step = 9980
+INFO:tensorflow:Average examples/sec: 3641.4 (3615.53), step = 9990
+INFO:tensorflow:Saving checkpoints for 10000 into /tmp/cifar10/model.ckpt.
+INFO:tensorflow:Loss for final step: 0.46667284.
+# (omit output)
+INFO:tensorflow:Evaluation [100/100]
+INFO:tensorflow:Finished evaluation at 2019-06-25-08:06:14
+INFO:tensorflow:Saving dict for global step 10000: accuracy = 0.7628, global_step = 10000, loss = 1.0683168
+INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10000: /tmp/cifar10/model.ckpt-10000
+2019-06-25 16:06:15 [INFO] Client exits with allocation ID fda38164-711b-4809-9984-2759a3a2165b
+* 应用程序启动时,Orion Client运行时会打印日志`VirtaiTech Resource. Build-cuda-xxx`。这一行日志说明应用程序成功加载了Orion Client运行时。
+* 应用程序退出时,Orion Client运行时会打印日志`Client exits with allocation ID xxx`。这一行日志说明应用程序在生命周期里成功向Orion Controller申请到了Orion vGPU资源,退出时释放这一资源。
+* TensorFlow启动时识别出了两块GPU,显存各自为11.72GB (对应于我们设置的`ORION_GMEM=12000`)
+* 对物理GPU的访问被Orion Server进程`oriond`完全接管
+* 两块Orion vGPU被调度到了两块物理GPU上
+* 我们限制了Orion vGPU对显存的占用
\ No newline at end of file
+# 场景三:在没有GPU的节点上使用远程节点GPU资源
+* 两台服务器,其中一台(`server1`)配备两张NVIDIA Tesla V100计算卡,每张16GB显存;另一台(`server0`)没有配备物理GPU
+* 两台服务器上均配备Mellanox ConnectX-5 25Gb网卡,通过光交换机互相连接
+* MLNX_OFED_LINUX04.5.2驱动及用户库
+* Ubuntu Server 16.04 LTS
+我们将Orion Server和Orion Controller均部署在配备有物理GPU的`server1`上,将`server0`作为Orion Client。我们在`server0`上安装了必要的Python3库以及TensorFlow 1.12 GPU版本:
+# On server0 (which has no GPU)
+sudo apt install python3-dev python3-pip
+sudo pip3 install tensorflow-gpu==1.12.0
+`server0`上没有物理GPU,也没有NVIDIA的软件环境,因此TensorFlow并不能使用GPU加速训练。一旦Orion vGPU软件部署完成,我们就可以在`server0`上运行TensorFlow通过Orion vGPU加速模型训练。我们会将TensorFlow official benchmark运行在随机生成数据、真实Imagenet数据两种场景下。
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装在`server1`上
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装在`server1`上并正常启动
+## Orion Server 参数配置
+在启动`server1`上的Orion Server服务前,我们需要修改配置文件,设置数据通路使得`server0`上的Orion Client可以连接到Orion Server;此外,我们需要关闭默认的共享内存模式,打开RDMA通道。
+### 数据通路设置
+由于我们要使用RDMA加速数据交换,我们应该将`bind_addr`设置为Orion Client在RDMA网段可以访问的地址。以本场景环境为例,两台服务器节点所在的RDMA网段为`192.168.25.xxx`,因此我们将`bind_addr`设置为Orion Server所在`server1`的RDMA网段地址``。
+### Orion Server模式设置
+我们选择RDMA加速数据传输,因此需要设置`enable_shm=false`,`enable_rdma=true`。此外,我们的Orion Client并非本地KVM虚拟机,因此我们要设置`enable_kvm=false`。
+### Orion Server 参数配置示例
+ listen_port = 9960
+ bind_addr =
+ enable_shm = "false"
+ enable_rdma = "true"
+ enable_kvm = "false"
+### 启动Orion Server
+我们需要重启Orion Server使新配置生效,并检查状态:
+# From server1
+sudo systemctl restart oriond
+sudo orion-check runtime server
+Searching NVIDIA GPU ...
+CUDA driver 418.67
+418.67 is installed.
+2 NVIDIA GPUs are found :
+ 0 : Tesla V100-PCIE-16GB
+ 1 : Tesla V100-PCIE-16GB
+Checking NVIDIA MPS ...
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [No]
+Enable RDMA [Yes]
+Enable Local QEMU-KVM with SHM [No]
+Binding IP Address :
+Listening Port : 9960
+Testing the Orion Server network ...
+Orion Server can be reached through
+Checking Orion Controller status ...
+[Info] Orion Controller setting may be different in different SHELL.
+[Info] Environment variable ORION_CONTROLLER has the first priority.
+Orion Controller addrress is set as in configuration file. Using this address to diagnose Orion Controller
+Address is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+表明Orion Controller将两块物理GPU一共虚拟成了8块Orion vGPU,目前均处于可用状态。
+## 安装Orion Client运行时
+### 安装至默认路径
+在`server0`中,我们运行Orion Client安装包:
+# From inside server0
+sudo ./install-client
+此时,用户没有指定安装路径,安装包会询问是否将Orion Client运行时安装到默认路径`/usr/lib/orion`下。得到用户许可后,安装包会通过`ldconfig`机制将Orion Client运行时添加到操作系统动态库搜索路径。
+Orion client environment will be installed to /usr/lib/orion
+Do you want to continue [n/y] ?y
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+Orion vGPU client environment has been installed in /usr/lib/orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/usr/lib/orion:$LD_LIBRARY_PATH
+由于安装包已经配置搜索路径,这里屏幕提示的`export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH`不是必需的。
+### (可选)安装至自定义路径
+# From inside server0
+sudo mkdir -p $INSTALLATION_PATH
+sudo ./install-client -d $INSTALLATION_PATH
+这种情形下,安装包会直接将Orion Client运行时安装到用户指定的`INSTALLATION_PATH=/orion`路径下,并向屏幕输出下列提示:
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+Orion vGPU client environment has been installed in /orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+用户在terminal内运行应用程序之前,一定要保证Orion Client运行时在操作系统动态库搜索路径中:
+# From current working terminal inside server0
+export LD_LIBRARY_PATH=/usr/local/orion:$LD_LIBRARY_PATH
+注意这条命令只对当前terminal生效。为方便起见,用户可以将上述语句加至`~/.bashrc`的最后一行,然后用`source ~/.bashrc`使其生效,此后以当前用户身份登录`server0`不需要反复设置。
+## Orion Client参数配置
+Orion Client端需要向Orion Controller发送对Orion vGPU资源的申请。我们可以更改`/etc/orion/client.conf`来达到参数配置的目的。
+由于Orion Controller监听在`server1`上的``上,我们将`controller_addr`设置为`server1`的任意一个从`server0`能够访问的IP地址即可。此处我们可以依旧使用RDMA网段的地址``,或者使用`server1`的一个TCP地址``。我们选择后者作为示范。
+ controller_addr =
+# From inside server0
+orion-check runtime client
+如果Orion Client `server0`内部可以连接到Orion Controller,输出为:
+# (omit output)
+Orion Controller addrress is set as in configuration file. Using this address to diagnose Orion Controller
+Address is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+## 运行TF Official Benchmark
+在`server0`上运行应用程序之前,我们用环境变量指定应用程序向Orion Controller申请的Orion vGPU数目与显存:
+export ORION_VGPU=2
+export ORION_GMEM=15500
+我们的每一块Tesla V100计算卡有16GB显存,因此如果用户将`ORION_GMEM`设置得少于8GB,两块Orion vGPU会被调度到同一块物理GPU上。这里我们设置Orion vGPU的显存为15500MB,那么两块Orion vGPU将分别调度到两块物理GPU上,方便我们展示双卡的模型训练。
+首先,我们将TF official benchmark repo克隆下来:
+# From inside server0
+git clone --branch=cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
+TF official benchmark支持两种运行模式:随机生成数据,或者用转换为TFRecord格式的Imagenet数据集。我们分别介绍这两种情形。
+### 使用随机生成数据(Synthetic data)
+下面的代码会使用两块Orion vGPU训练inception_v3模型,每块vGPU上的batch_size=128, 总batch_size为256。
+python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+ --data_name=imagenet \
+ --model=inception3 \
+ --optimizer=rmsprop \
+ --num_batches=500 \
+ --num_gpus=2 \
+ --batch_size=128
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+2019-06-25 19:55:37.099719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:d9:00.0
+totalMemory: 15.14GiB freeMemory: 15.14GiB
+2019-06-25 19:55:37.218239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:d9:00.0
+totalMemory: 15.14GiB freeMemory: 15.14GiB
+2019-06-25 19:55:37.222562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
+2019-06-25 19:55:37.222765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
+2019-06-25 19:55:37.222795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
+2019-06-25 19:55:37.222815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
+2019-06-25 19:55:37.222831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
+2019-06-25 19:55:37.222994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14725 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d9:00.0, compute capability: 7.0)
+2019-06-25 19:55:37.225850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14725 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d9:00.0, compute capability: 7.0)
+TensorFlow: 1.12
+Model: inception3
+Dataset: imagenet (synthetic)
+Mode: BenchmarkMode.TRAIN
+SingleSess: False
+Batch size: 256 global
+ 128.0 per device
+Num batches: 500
+Num epochs: 0.10
+Devices: ['/gpu:0', '/gpu:1']
+Data format: NCHW
+Optimizer: rmsprop
+Variables: parameter_server
+Generating training model
+# (omit output)
+Running warm up
+Done warm up
+Step Img/sec total_loss
+1 images/sec: 442.6 +/- 0.0 (jitter = 0.0) 7.416
+# (omit output)
+490 images/sec: 434.6 +/- 1.2 (jitter = 19.5) 7.384
+500 images/sec: 435.1 +/- 1.2 (jitter = 19.9) 7.378
+total images/sec: 435.00
+2019-06-25 20:01:06 [INFO] Client exits with allocation ID b928be93-0b40-4252-b6b4-291ca4c99462
+* 应用程序启动时,Orion Client运行时会打印日志`VirtaiTech Resource. Build-cuda-xxx`。这一行日志说明应用程序成功加载了Orion Client运行时。
+* 应用程序退出时,Orion Client运行时会打印日志`Client exits with allocation ID xxx`。这一行日志说明应用程序在生命周期里成功向Orion Controller申请到了Orion vGPU资源,退出时释放这一资源。
+* TensorFlow启动时识别出了两块GPU,显存各自为15.14GB (对应于我们设置的`ORION_GMEM=15500`)
+* 对物理GPU的访问被Orion Server进程`oriond`完全接管
+* 两块Orion vGPU被调度到了两块物理GPU上
+* 我们限制了Orion vGPU对显存的占用
+### (可选)使用TFRecord格式的Imagenet数据集
+下面的命令会使用两块Orion vGPU在真实Imagenet数据集上使用`rmsprop`优化器训练inception_v3模型,每块vGPU上的batch_size=128, 总batch_size为256。我们在全部训练集上训练5个完整的epoch,训练过程中的checkpoionts存储在`./train_dir`目录。
+python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+ --data_dir=$IMAGENET_DIR \
+ --data_name=imagenet \
+ --print_training_accuracy=True \
+ --train_dir=./train_dir \
+ --save_model_steps=1000 \
+ --eval_during_training_every_n_steps=5000 \
+ --save_summaries_steps=1000 \
+ --summary_verbosity=3 \
+ --model=inception3 \
+ --optimizer=rmsprop \
+ --num_epochs=5 \
+ --num_gpus=2 \
+ --batch_size=128
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+# (omit output)
+TensorFlow: 1.12
+Model: inception3
+Dataset: imagenet
+Mode: BenchmarkMode.TRAIN_AND_EVAL
+SingleSess: False
+Batch size: 256 global
+ 128.0 per device
+Num batches: 25022
+Num epochs: 5.00
+Devices: ['/gpu:0', '/gpu:1']
+Data format: NCHW
+Optimizer: rmsprop
+Variables: parameter_server
+Generating training model
+# (omit output)
+Running warm up
+Done warm up
+Step Img/sec total_loss top_1_accuracy top_5_accuracy
+1 images/sec: 318.7 +/- 0.0 (jitter = 0.0) 7.415 0.004 0.004
+10 images/sec: 357.5 +/- 7.6 (jitter = 15.8) 7.364 0.000 0.000
+# (omit output)
+1490 images/sec: 370.6 +/- 0.3 (jitter = 7.9) 6.736 0.008 0.043
+1500 images/sec: 370.6 +/- 0.3 (jitter = 7.9) 6.654 0.020 0.070
+# (omit output)
+24990 images/sec: 368.8 +/- 0.1 (jitter = 7.8) 3.411 0.352 0.613
+25000 images/sec: 368.8 +/- 0.1 (jitter = 7.8) 3.493 0.344 0.629
+Running evaluation at global_step 25010
+# (omit output)
+Accuracy @ 1 = 0.3692 Accuracy @ 5 = 0.6404 [249856 examples]
+2019-06-26 01:49:26 [INFO] Client exits with allocation ID a2062e12-8199-4515-a8ec-59dfe9723b4d
+# From inside server0
+tensorboard --logdir ./train_dir
+## 附:确认Orion平台工作在RDMA模式下
+如果Orion Server或者Orion Client端的RDMA驱动不能正常工作,或者Orion Server的`bind_addr`没有设置为RDMA网段的地址,Orion vGPU软件会自动切换成TCP模式进行数据传输。此时,性能会有明显下降。因此,我们需要通过Orion Server的日志来确认RDMA模式成功开启。
+`server1`上,Orion Server的日志输出到`/var/log/orion/session`中。我们进到这个目录下,用`ll -rt`找到最新的日志文件:
+# From inside server1
+cd /var/log/orion/session
+ll -rt
+``` bash
+3686900184038864: 2019-06-25 20:24:52 [INFO] Resource successfully confirmed with controller
+3686900190516972: 2019-06-25 20:24:52 [INFO] Resource successfully confirmed with controller
+3686900190589780: 2019-06-25 20:24:52 [INFO] Final virtual gpu list : 0:0:15500,0:1:15500, begin to initialize CUDA device manager
+3686904167685066: 2019-06-25 20:24:54 [INFO] Registered v-GPU 0 on p-GPU 0.
+3686904167700354: 2019-06-25 20:24:54 [INFO] Registered v-GPU 0 on p-GPU 1.
+3686904167738402: 2019-06-25 20:24:54 [INFO] Architecture initialization is done. Resource is confirmed.
+3686904331240842: 2019-06-25 20:24:54 [INFO] Client supports RDMA mode, then server runs in RDMA mode.
+3686904345514000: 2019-06-25 20:24:54 [INFO] Launching workers ...
+[INFO] Client supports TCP mode, then server also falls back to TCP mode.
+说明Orion平台工作模式退化为TCP,用户需要检查RDMA环境,以及Orion Server数据通路的设置。
+如果Orion Server启动时,配置文件中的`enable_shm`和`enable_rdma`均为`false`,则Orion vGPU软件会默认工作在TCP模式,Orion Server日志中也不会有`Client supports XXX mode...`这行日志。
\ No newline at end of file
+# 常见问题与解答
+本章节我们针对用户在阅读和使用Quick Start的过程中可能遇到的问题进行解答。更加全面的常见问题列表,读者可以参考[用户手册相关部分](../Orion-User-Guide.md#常见问题)。
+## 安装部署常见问题
+## 运行失败
+VirtaiTech=>with/without allocation id
+with allocation id: fail to connect to server
+* GPU节点CUDA环境配置出错(或`deb`安装)
+* 安装时未指定`CUDA_HOME`环境变量
+* Orion Controller无法连接到系统中已有`etcd`服务
+* [INFO] Client exits without allocation ID.
+* Orion Server bind address出错
+* Orion Client ORION_CONTROLLER设置出错(或client.conf出错)
+* Orion Client 没有设置ORION_VGPU环境变量
+* Orion Client由于防火墙设置,无法与Orion Controller和Orion Server通信
+* container内没有mount SHM
+* 多个container使用了相同的SHM
+* 把/dev/shm目录全mount进了容器
+* SHM被误删,而没有重启server
+* Controller被杀死,重启后没有重启server
+* 修改`/etc/orion/server.conf`后没有重启`server`
+## Orion Client状态检查
+## **资源分配相关错误**
+## **显存 quota 相关**
+## **MPS相关错误**
+# Docker镜像
+我们准备了安装有Orion Client Runtime,以及TensorFlow,PyTorch的不同镜像。其中,
+* TensorFlow 1.12直接从`pip`源安装
+* PyTorch 1.1.0从官方源码直接编译生成
+* 镜像内操作系统均为`Ubuntu 16.04`
+* 在部分镜像中,我们还安装了`MNLX_OFED 4.5.1`RDMA驱动
+此repo中的Dockerfile对应于Orion vGPU软件的官方[Docker Hub Registry](https://hub.docker.com/r/virtaitech/orion-client)。
+* `install-client`安装包
+* MLNX_OFED 4.5-驱动
+* 以及PyTorch从源码编译得到的wheel包
+需要用户自行放置到路径下,方可成功运行`docker build`。
+## [TensorFlow 1.12 基础镜像](./client-tf1.12-base)
+docker pull virtaitech/orion-client:tf1.12-base
+此镜像中通过`pip3 install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+## [TensorFlow 1.12 带MNLX驱动,Python 3.5环境](./client-tf1.12-py3)
+docker pull virtaitech/orion-client:tf1.12-py3
+此镜像中通过`pip3 install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+此外,我们安装了`MNLX_OFED 4.5.1`RDMA驱动,用户如果将Mellanox的RDMA设备直通到容器内部,就可以参照quick-start文档中的[通过RDMA使用远程节点GPU资源](./quick-start/remote_rdma.md)章节内容在容器内部使用远程GPU资源。
+为了展示的方便,我们同样安装了Juypter Notebook和部分Python packages。
+## [TensorFlow 1.12 带MNLX驱动,Python 2.7环境](./client-tf1.12-py2)
+docker pull virtaitech/orion-client:tf1.12-py2
+此镜像中通过`pip install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+此外,我们安装了`MNLX_OFED 4.5.1`RDMA驱动,用户如果将Mellanox的RDMA设备直通到容器内部,就可以参照quick-start文档中的[通过RDMA使用远程节点GPU资源](./quick-start/remote_rdma.md)章节内容在容器内部使用远程GPU资源。
+本镜像中,我们安装了部分Python packages,以便用户使用[TensorFlow Object Detection](https://github.com/tensorflow/models/tree/master/research/object_detection)模型,以及其余[官方Models](https://github.com/tensorflow/models)。
+## [PyTorch 1.1.0, Python 3.5环境](./client-pytorch-1.1.0-py3)
+由于PyTorch官方提供的`pip`源wheel包里面编译了太多组件,部分组件我们这一版的Orion vGPU软件还不支持,我们通过PyTorch的源码编译了1.1.0版本的wheel包。我们没有对源码进行任何修改,只是更改了编译选项。
+我们同样从源码开始,使用默认编译选项编译了torchvision 0.3.0版本,打包进镜像。我们也安装了部分Python packages,使得用户可以直接在镜像里面运行PyTorch的官方examples:https://github.com/pytorch/examples
+最后,我们通过通过`install-client`安装包安装了Orion Client运行时。
+我们在[PyTorch 1.10 Python3.5 镜像](./client-pytorch-1.1.0-py3)中介绍了我们编译PyTorch 1.1.0,TorchVision 0.3.0,以及安装Orion Client Runtime的步骤,用户可以参考。
+### 注意事项
+由于PyTorch DataLoader需要通过IPC通讯,启动容器时需要通过`--shm-size=8G`参数保证DataLoader可以正常工作。这一点对于Native环境也是一样的。
+* 我们还不支持PyTorch通过RDMA网络使用远程GPU资源
+* 在使用多卡训练时,需要用GLOO作为后端,而不是默认的NCCL
+在我们的[一篇技术博客](../blogposts/use-pytorch.md)里,我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
\ No newline at end of file
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y libjpeg-dev zlib1g-dev libopenmpi-dev libomp-dev &&\
+ apt clean
+# Setup pip source
+COPY pip.conf /etc/
+WORKDIR /root
+# Install PyTorch, torchvision and other python packages
+COPY torch-1.1.0-cp35-cp35m-linux_x86_64.whl .
+RUN pip3 install torch-1.1.0-cp35-cp35m-linux_x86_64.whl && rm torch-1.1.0-cp35-cp35m-linux_x86_64.whl
+COPY requirement.txt .
+RUN pip3 install -r requirement.txt && rm requirement.txt
+COPY torchvision /usr/local/lib/python3.5/dist-packages/torchvision
+# Install Orion Client runtime
+ENV CUDA_HOME=/usr/local/cuda-9.0
+RUN mkdir -p $CUDA_HOME && mkdir -p $CUDA_HOME/lib64
+COPY install-client .
+RUN chmod +x install-client && ./install-client -d $CUDA_HOME/lib64 -q && rm install-client
+RUN ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnvToolsExt.so.1 &&\
+ ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnccl.so.2
+# Set the num of Orion vGPU each process requests from Orion Controller
+WORKDIR /root
+CMD ["/bin/bash"]
@@ -0,0 +1,124 @@
+# PyTorch 1.10 Python3.5 镜像
+## 注意事项
+由于PyTorch DataLoader需要通过IPC通讯,启动容器时需要通过`--shm-size=8G`参数保证DataLoader可以正常工作。这一点对于Native环境也是一样的。
+* 我们还不支持PyTorch通过RDMA网络使用远程GPU资源
+* 在使用多卡训练时,需要用GLOO作为后端,而不是默认的NCCL
+在我们的[一篇技术博客](../../blogposts/use-pytorch.md)里,我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
+## 从源码编译PyTorch 1.1.0的Python 3.5版本
+我们以Ubuntu 16.04环境为例。
+首先`git clone`相应的repo,以及第三方依赖项:
+git clone --recursive https://github.com/pytorch/pytorch
+cd pytorch
+git checkout v1.1.0 # switch to v1.1.0 branch
+git checkout v1.1.0
+git submodule sync
+git submodule update --init --recursive
+apt install python3-dev python3-pip cmake g++ \
+ libopenmpi-dev libomp-dev libjpeg-dev zlib1g-dev
+pip3 install numpy pillow
+export NO_TEST=1
+export NO_FBGEMM=1
+export NO_MIOPEN=1
+export NO_MKLDNN=1
+export NO_NNPACK=1
+export NO_QNNPACK=1
+export TORCH_CUDA_ARCH_LIST="3.5;6.0;6.1;7.0"
+cd pytorch
+python3 setup.py bdist_wheel
+## 从源码编译TorchVision 0.3.0
+最新(2019/06/29)的TorchVision 0.3.0和PyTorch 1.1.0相匹配。从源码build PyTorch之后,TorchVision也需要重新build。
+git clone https://github.com/pytorch/vision.git
+cd vision
+git checkout v0.3.0
+python3 setup.py build
+ls build/lib.linux-x86_64-3.5/torchvision
+_C.cpython-35m-x86_64-linux-gnu.so __init__.py ops utils.py
+datasets models transforms version.py
+构建Docker镜像时,只要拷贝这个目录到容器内Python3.5 dist-packages路径即可:
+COPY torchvision /usr/local/lib/python3.5/dist-packages/torchvision
+## 最后步骤
+在运行`docker build`之前,用户需要把`install-client`安装包,以及上面两步得到的PyTorch wheel包,以及TorchVision都放到Dockerfile所在路径下。
+## 附录:安装Orion Client运行时
+我们进一步解释Dockerfile中安装Orion Client Runtime相关的步骤。
+PyTorch经过CMake编译后指定了RPATH。如果用户build PyTorch时,`CUDA_HOME=/usr/local/cuda-9.0`,那么容器内Orion Client运行时必须安装到这个路径下才可以支持PyTorch使用Orion vGPU。
+ENV CUDA_HOME=/usr/local/cuda-9.0
+RUN mkdir -p $CUDA_HOME && mkdir -p $CUDA_HOME/lib64
+COPY install-client .
+RUN chmod +x install-client && ./install-client -d $CUDA_HOME/lib64 -q && rm install-client
+RUN ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnvToolsExt.so.1 &&\
+ ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnccl.so.2
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt clean
+# Configurate pip source
+COPY pip.conf /etc/
+# Install TensorFlow 1.12 GPU version
+RUN pip3 install tensorflow-gpu==1.12.0
+# Install Python packages
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt && rm requirements.txt
+WORKDIR /root
+# TensorFlow official benchmark
+RUN git clone --branch=cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+# Set default ORION_VGPU for each process requesting vgpu resources
+WORKDIR /root
+CMD ["/bin/bash"]
+# 构建镜像
+用户只需将`install-client`安装包放到Dockerfile所在的路径下,即可通过`docker build`命令构建镜像。
\ No newline at end of file
+set -e
+cd `dirname $0`
+docker build -t orion-client:tf1.12-py3 .
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python-dev python-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt install -y python-tk &&\
+ apt clean
+# Install RDMA driver
+RUN tar xvf MLNX_OFED_LINUX-4.5- &&\
+ cd MLNX_OFED_LINUX-4.5- &&\
+ ./mlnxofedinstall --user-space-only --without-fw-update --all --force -q &&\
+ cd /tmp && rm -rf *
+# Configurate pip source
+COPY pip.conf /etc/
+# Install TensorFlow 1.12 GPU version
+RUN pip install tensorflow-gpu==1.12.0
+# Install Python packages
+COPY requirements.txt .
+RUN pip install -r requirements.txt && rm requirements.txt
+WORKDIR /root
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+# Set default ORION_VGPU for each process requesting vgpu resources
+WORKDIR /root
+CMD ["/bin/bash"]
+# 构建镜像
+然后,用户需要在Mellanox官网下载MLNX_OFED 4.5-驱动:
+最后,用户可以通过`docker build`命令构建镜像。
\ No newline at end of file
+set -e
+cd `dirname $0`
+docker build -t orion-client:tf1.12-py3 .
diff --git a/dockerfiles/client-tf1.12-py2/requirements.txt b/dockerfiles/client-tf1.12-py2/requirements.txt
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt clean
+# Install RDMA driver
+RUN tar xvf MLNX_OFED_LINUX-4.5- &&\
+ cd MLNX_OFED_LINUX-4.5- &&\
+ ./mlnxofedinstall --user-space-only --without-fw-update --all --force -q &&\
+ cd /tmp && rm -rf *
+# Configurate pip source
+COPY pip.conf /etc/
+# Install TensorFlow 1.12 GPU version
+RUN pip3 install tensorflow-gpu==1.12.0
+# Install Python packages
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt && rm requirements.txt
+WORKDIR /root
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+# Set default ORION_VGPU for each process requesting vgpu resources
+WORKDIR /root
+CMD ["/bin/bash"]
+# 构建镜像
+然后,用户需要在Mellanox官网下载MLNX_OFED 4.5-驱动:
+最后,用户可以通过`docker build`命令构建镜像。
\ No newline at end of file
+set -e
+cd `dirname $0`
+docker build -t orion-client:tf1.12-py3 .
+cd `dirname $0`
+function print_help {
+ echo "Usage: install-server.sh [-h|-d [target path]]"
+ echo " -d installed target path. Default /usr/bin"
+ echo " -h print this help"
+while getopts "d:h" opt
+ case $opt in
+ d) install_path=$OPTARG;;
+ h)
+ print_help
+ exit 0;;
+ ?)
+ print_help
+ exit 1;;
+ esac
+if [ ! -f oriond ]; then
+ echo "Can not find binary oriond. Please check your install package."
+ exit 1
+if [ ! -f orion-check ]; then
+ echo "Can not find binary orion-check. Please check your install package."
+ exit 1
+if [ ! -f orion-shm ]; then
+ echo "Can not find binary orion-shm. Please check your install package."
+ exit 1
+if [ "$(id -u)" != "0" ]; then
+ echo "Error. Root privilege is required to install Orion Server."
+ exit 1
+if systemctl status oriond > /dev/null 2>&1; then
+ systemctl stop oriond
+mkdir -p /var/log/orion
+chmod 777 /var/log/orion
+cp oriond orion-check orion-shm $install_path
+chmod 755 $install_path/oriond
+chmod 755 $install_path/orion-check
+chmod 755 $install_path/orion-shm
+if which virsh > /dev/null 2>&1; then
+ if virsh capabilities | grep -F "apparmor" > /dev/null 2>&1; then
+ armor_qemu_file=/etc/apparmor.d/abstractions/libvirt-qemu
+ if [ -f $armor_qemu_file ]; then
+ if grep -F "orionsock*" $armor_qemu_file > /dev/null; then
+ :
+ else
+ sed -i '/^\s*\/[{]*dev\>.*\/shm\>\s*r,.*/a \ \ \/dev\/shm\/orionsock* rw,' $armor_qemu_file
+ systemctl reload apparmor.service
+ fi
+ fi
+ fi
+if [ -f orion.conf.template ]; then
+ mkdir -p /etc/orion
+ cp orion.conf.template /etc/orion/server.conf
+ chmod 755 /etc/orion
+ chmod 644 /etc/orion/server.conf
+ echo "orion.conf.template is copied to /etc/orion/server.conf as Orion Server configuration file."
+echo "Orion Server is successfully installed to $install_path"
+cat > /etc/systemd/system/oriond.service << EOF
+Description=Orion Server Daemon Service
+systemctl reload oriond > /dev/null 2>&1
+systemctl enable oriond > /dev/null 2>&1
+echo "Orion Server is registered as system service."
+echo "Using following commands to interact with Orion Server :"
+echo -e "\n\tsystemctl start oriond # start oriond daemon"
+echo -e "\tsystemctl status oriond # print oriond daemon status and screen output"
+echo -e "\tsystemctl stop oriond # stop oriond daemon"
+echo -e "\tjournalctl -u oriond # print oriond stdout"
+echo -e "\nBefore launching Orion Server, please change settings in /etc/orion/server.conf according to your environment.\n"
+74872f49042cf6570b821e1cba7c892e ./oriond
+93eded304497cff77b4f3ef013108e16 ./install-server.sh
+69d642f58f3c793c885111188409792b ./orion-shm
+93fd51ccba8a56c695d58ecb7a83c6e2 ./orion.conf.template
+c5974ac53ff4b55e64e3023a26dd91d3 ./orion-controller
+2a593887296675fd80a3a2d8fb190dcb ./install-client
+5b2745b18ca619eebcce4c86d5258081 ./orion-check
+cd `dirname $0`
+function print_help {
+ echo "
+ Orion Health Check Tool
+ v0.1
+ install
+ server Check health for installing Orion Server
+ client Check health for installing Orion Client
+ controller Check health for installing Orion Controller
+ all Check health for installing all Orion components
+ runtime
+ server Diagnose the status for Orion Server running
+ client Diagnose the status for Orion Client running
+ orion-check install server
+ orion-check install client
+ orion-check runtime server
+while getopts "h" opt
+ case $opt in
+ h)
+ print_help
+ exit 0;;
+ ?)
+ print_help
+ exit 1;;
+ esac
+if [ -z "$1" -o -z "$2" ]; then
+ echo "Invalid usage for Orion Health Check Tool."
+ print_help
+ exit 1
+function check_os {
+ KERNAL_VERSION=$(uname -r)
+ if [ ! -f /etc/os-release ]; then
+ echo -e "\nOS information : unknown OS"
+ echo -e " : Kernel $KERNAL_VERSION"
+ return 0
+ fi
+ OS_NAME=$(cat /etc/os-release | grep -w NAME | awk -F '"' '{print $2}' | awk '{print $1}')
+ OS_VERSION=$(cat /etc/os-release | grep -w VERSION_ID | awk -F '"' '{print $2}')
+ echo -e "\nOS information : $OS_NAME $OS_VERSION"
+ echo -e " : Kernel $KERNAL_VERSION"
+ if [ $OS_NAME == "CentOS" ]; then
+ if [ $OS_VERSION == "7" ]; then
+ summary_os_support="Yes"
+ else
+ summary_os_support="No"
+ fi
+ elif [ $OS_NAME == "Ubuntu" ]; then
+ if [ $OS_VERSION == "16.04" ]; then
+ summary_os_support="Yes"
+ else
+ summary_os_support="Unknown"
+ fi
+ else
+ summary_os_support="Unknown"
+ fi
+function check_hw_configuration {
+ echo -e "\nChecking CPU configuration ..."
+ lscpu | head -n -1
+ echo -e "\nChecking disk space ..."
+ df -hT
+function find_rdma_support {
+ echo -e "\nChecking RDMA network support ..."
+ if ls /dev/infiniband/rdma_cm > /dev/null 2>&1; then
+ if ls /dev/infiniband/uverbs* > /dev/null 2>&1; then
+ rdma_driver=1
+ for path in ${default_lib_path[@]}; do
+ if [ -d $path ]; then
+ result=$(find $path -name "librdmacm.so")
+ if [ -n "$result" ]; then
+ rdma_support=1
+ fi
+ fi
+ done
+ fi
+ fi
+ if [ $rdma_support -eq 0 ]; then
+ if [ $rdma_driver -eq 0 ]; then
+ echo "No RDMA network support is found in the system."
+ else
+ echo "RDMA device is found but rdmacm library is not found in default searching path."
+ fi
+ else
+ echo "RDMA support is found in the system."
+ summary_rdma_support="Yes"
+ # Try to get information by using Mellanox OFED tools
+ if which ibdev2netdev > /dev/null 2>&1; then
+ printf "\n RMDA-Port RDMA-Rate Interface Status\n"
+ printf " ------------------------------------------------\n"
+ i=0
+ ibdev2netdev |
+ while IFS= read -r line
+ do
+ mlx_port_name[$i]=$(echo $line | awk '{print $1}')
+ mlx_port_interface[$i]=$(echo $line | awk '{print $5}')
+ mlx_port_interface_status[$i]=$(echo $line | awk '{print $6}')
+ if which ibstatus > /dev/null 2>&1; then
+ mlx_port_rate[$i]=$(ibstatus ${mlx_port_name[$i]} | grep -F "rate:" | awk '{print $2, $3}')
+ fi
+ printf "%8s %14s %14s %10s\n" "${mlx_port_name[$i]}" "${mlx_port_rate[$i]}" "${mlx_port_interface[$i]}" "${mlx_port_interface_status[$i]}"
+ let i++
+ done
+ fi
+ fi
+function find_cuda {
+ echo -e "\nSearching CUDA ..."
+ if [ -n "$CUDA_HOME" ]; then
+ echo "CUDA_HOME is set to ${CUDA_HOME}"
+ fi
+ possible_path=$(find $cuda_install_path -maxdepth 1 -type d -name "cuda*")
+ if [ -z "$possible_path" ]; then
+ echo -e "\033[31m[Error] Fail to find cuda in default path $cuda_install_path:\033[0m"
+ return 1
+ fi
+ i=0
+ for path in $possible_path; do
+ if ls $path/version.txt > /dev/null 2>&1; then
+ version=$(cat $path/version.txt | head -n 1)
+ echo "Find $version in $path"
+ if echo $version | grep "\<9.0." > /dev/null; then
+ summary_cuda_support="Yes"
+ fi
+ cuda_path[$i]=$path
+ let i++
+ fi
+ done
+function find_cudnn {
+ echo -e "\nSearching CUDNN ..."
+ i=0
+ for path in ${cuda_path[@]}; do
+ version=
+ pushd $path/lib64 > /dev/null
+ libcudnn=$(find -type l -name "libcudnn.so*" -o -type f -name "libcudnn.so*" | awk -F '/' '{print $2}')
+ if [ -n "$libcudnn" ]; then
+ version=$(find -type f -name "libcudnn.so*" | awk -F 'so.' '{print $2}')
+ if [ -z "$version" ]; then
+ version="(unknown version)"
+ if [ $summary_cudnn_support == "No" ]; then
+ summary_cudnn_support="Unknown"
+ fi
+ fi
+ echo "CUDNN $version is installed in $path"
+ ls -l libcudnn.so* | awk '{print " ", $9, $10, $11}'
+ cudnn_version[$i]=${version}
+ cudnn_major=$(echo $version | awk -F '.' '{print $1}')
+ cudnn_mid=$(echo $version | awk -F '.' '{print $2}')
+ if [ $cudnn_major == "7" ]; then
+ if [ $cudnn_mid -gt 1 ]; then
+ summary_cudnn_support="Yes"
+ fi
+ fi
+ let i++
+ fi
+ popd > /dev/null
+ done
+ if [ $i -eq 0 ]; then
+ echo "No CUDNN library is found."
+ fi
+function find_nvidia_gpu {
+ cuda_driver_version=
+ nvidia_gpu=
+ echo -e "\nSearching NVIDIA GPU ..."
+ if nvidia-smi -i 0 -q > /dev/null; then
+ cuda_driver_version=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)
+ echo "CUDA driver $cuda_driver_version is installed."
+ gpus=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader)
+ if [ -z "$gpus" ]; then
+ echo -e "\033[33m[Warning] No NVIDIA GPU is found in the system.\033[0m"
+ return 0
+ fi
+ summary_nvidia_gpu_support="Yes"
+ i=0
+ tmp_ifs=$IFS
+ IFS=$'\n'
+ for name in $gpus; do
+ nvidia_gpu[$i]="$name"
+ let i++
+ done
+ IFS=$tmp_ifs
+ if [ $i -gt 1 ]; then
+ echo "$i NVIDIA GPUs are found :"
+ else
+ echo "$i NVIDIA GPU is found :"
+ fi
+ i=0
+ for name in "${nvidia_gpu[@]}"; do
+ echo " $i :" "$name"
+ let i++
+ done
+ else
+ echo -e "\033[31m[Error] Fail to get NVIDIA driver version.\033[0m"
+ return 1
+ fi
+function find_mps_support {
+ echo -e "\nChecking NVIDIA MPS ..."
+ user=$(ps -aux | grep -v grep | grep -F "nvidia-cuda-mps-control" | awk '{print $1}')
+ if [ -n "$user" ]; then
+ enable_nvidia_mps=1
+ summary_nvidia_mps="ON"
+ echo "NVIDIA CUDA MPS is running by Linux account : $user"
+ echo -e "\033[33m[Info] Orion only supports enabling MPS with NVIDIA Volta and later GPU.\033[0m"
+ else
+ echo "NVIDIA CUDA MPS is off."
+ fi
+function find_etcd {
+ echo -e "\nSeaching etcd service ..."
+ bin=
+ running_bin=$(ps -aux | grep -v grep | grep -w etcd | awk '{print $11}')
+ if [ -z "$running_bin" ]; then
+ if which etcd > /dev/null 2>&1; then
+ bin=$(which etcd)
+ fi
+ else
+ bin=$running_bin
+ etcd_running=1
+ summary_etcd_support="Yes"
+ fi
+ if [ -z "$bin" ]; then
+ echo "No etcd is running or installed in the system."
+ return 1
+ fi
+ etcd_version=$($bin --version | grep "etcd Version" | awk '{print $3}')
+ echo "etcd (version $etcd_version) is installed in $bin"
+ major=${etcd_version:0:1}
+ if [ $major -eq 2 ]; then
+ etcd_v2=1
+ elif [ $major -eq 3 ]; then
+ etcd_v3=1
+ fi
+function find_qemu_kvm {
+ echo -e "\nSearching VM support ..."
+ running_bin=$(ps -aux | grep -v grep | grep -w libvirtd | awk '{print $11}')
+ if [ -z "$running_bin" ]; then
+ echo "libvirtd is not running. Please install libvirt-bin and start libvirtd before luanching Orion Server."
+ return 1
+ fi
+ qemu_api_version=$(virsh version | grep "Using API" | awk '{print $4}')
+ qemu_version=$(virsh version | grep "Running hypervisor" | awk '{print $4}')
+ qemu_version_major=$(echo $qemu_version | awk -F '.' '{print $1}')
+ qemu_version_mid=$(echo $qemu_version | awk -F '.' '{print $2}')
+ echo "QEMU API version : $qemu_api_version"
+ echo "QEMU version : $qemu_version"
+ if [ $qemu_version_major == "2" ]; then
+ summary_qemu_kvm_support="Yes"
+ fi
+ nets=$(virsh net-list | tail -n +3 | head -n -1 | awk '{print $1}')
+ i=0
+ for net in $nets; do
+ qemu_net_list[$i]=$(virsh net-dumpxml $net | grep -F " /dev/null 2>&1; then
+ docker_installed=1
+ if docker images 2>&1 | grep "Cannot connect" > /dev/null; then
+ echo "Docker is not launched in the system"
+ return
+ fi
+ if docker images 2>&1 | grep "permission denied" > /dev/null; then
+ echo "Permission denied to check docker environment."
+ return
+ fi
+ summary_docker_support="Yes"
+ docker version
+ docker_gateway=$(docker inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' bridge)
+ if [ -z "$docker_gateway" ]; then
+ docker_gateway=$(ip addr show docker0 2>/dev/null | grep inet | awk '{print $2}' | awk -F '/' '{print $1}')
+ fi
+ else
+ echo "Docker is not installed in the system"
+ fi
+function check_server_install {
+ echo -e "\nChecking Orion Server binary ..."
+ if [ ! -f ${server_name} ]; then
+ echo "Can not find installation file \"${server_name}\""
+ return 1
+ fi
+ if [ -z "$CUDA_HOME" ]; then
+ echo -e "\033[33mCUDA_HOME is not set in current enviornment. You may want to set it before doing the checking.\033[0m"
+ else
+ fi
+ unfound_lib=$(ldd ${server_name} | grep "not found" | awk '{print $1}')
+ if [ -n "$unfound_lib" ]; then
+ echo -e "\033[31mFollowing libraries are needed but not found :\033[0m"
+ echo "$unfound_lib"
+ return 1
+ fi
+ summary_server_support="Yes"
+function check_controller_runtime {
+ config_path=
+ if [ "$1" == "server" ]; then
+ config_path="/etc/orion/server.conf"
+ else
+ config_path="/etc/orion/client.conf"
+ fi
+ echo -e "\nChecking Orion Controller status ..."
+ echo -e "\033[33m[Info] Orion Controller setting may be different in different SHELL.\033[0m"
+ echo -e "\033[33m[Info] Environment variable ORION_CONTROLLER has the first priority.\033[0m\n"
+ controller_addr_env=$ORION_CONTROLLER
+ controller_addr=
+ if [ -r $config_path ]; then
+ controller_addr=$(sed -n 's/^\s*controller_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*:[0-9]*\).*/\1/p' $config_path)
+ controller_addr=${controller_addr:-""}
+ fi
+ if [ -z "$controller_addr_env" -a -z "$controller_addr" ]; then
+ echo -e "\033[31m[Error] No Orion Controller address is set in either environment variable ORION_CONTROLLER or configuration file.\033[0m"
+ return
+ fi
+ controller_ip=
+ controller_port=
+ target_addr=
+ if [ -n "$controller_addr_env" ]; then
+ target_addr=$controller_addr_env
+ echo "Environment variable ORION_CONTROLLER is set as ${controller_addr_env} Using this address to diagnose Orion Controller."
+ else
+ target_addr=$controller_addr
+ echo "Orion Controller addrress is set as $controller_addr in configuration file. Using this address to diagnose Orion Controller"
+ fi
+ controller_ip=$(echo $target_addr | awk -F ':' '{print $1}')
+ controller_port=$(echo $target_addr | awk -F ':' '{print $2}')
+ if [ -z "$controller_port" ]; then
+ echo -e "\033[31m[Error] Invalid Orion Controller address. No port is specified.\033[0m"
+ return
+ fi
+ if which nc > /dev/null 2>&1; then
+ if nc -zv $controller_ip $controller_port > /dev/null 2>&1; then
+ echo "Address $target_addr is reached."
+ else
+ echo -e "\033[31m[Error] Can not reach ${target_addr}. Please make sure Orion Controller is launched at the address, and the firewall is correctly set.\033[0m"
+ return
+ fi
+ fi
+ if which curl > /dev/null 2>&1; then
+ result=$(curl -s "http://$target_addr/info?data_version&api_version")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not reach ${target_addr}. Please make sure Orion Controller is launched at the address, and the firewall is correctly set.\033[0m"
+ return
+ else
+ echo "Orion Controller Version Infomation : $result"
+ fi
+ data_version=$(echo $result | awk -F ',' '{print $1}')
+ api_version=$(echo $result | awk -F ',' '{print $2}')
+ data_version=${data_version/=/:}
+ api_version=${api_version/=/:}
+ result=$(curl -s -H "${data_version}" -H "${api_version}" "http://$target_addr/devices?res=nvidia_cuda&used=true")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not fetch vGPU status from Orion Controller.\033[0m"
+ return
+ else
+ free_num=${result##*=}
+ fi
+ result=$(curl -s -H "${data_version}" -H "${api_version}" "http://$target_addr/devices?res=nvidia_cuda")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not fetch vGPU status from Orion Controller.\033[0m"
+ return
+ else
+ total_num=${result##*=}
+ fi
+ echo "There are $total_num vGPU under managered by Orion Controller. $free_num vGPU are free now."
+ else
+ echo -e "\033[33mLinux curl is needed to diagnose Orion Controller.\033[0m"
+ return
+ fi
+function check_oriond_runtime {
+ echo -e "\nChecking Orion Server status ..."
+ if ! which netstat > /dev/null 2>&1; then
+ echo "Linux tool netstat is not found. Installing the tool helps to diagnose the system."
+ fi
+ if ! which nc > /dev/null 2>&1; then
+ echo "Linux tool net-cat is not found. Installing the tool helps to diagnose the system."
+ fi
+ if ! which curl > /dev/null 2>&1; then
+ echo "Linux tool curl is not found. Installing the tool helps to diagnose the system."
+ fi
+ running_bin=$(ps -aux | grep -v grep | grep -w oriond)
+ if [ -z "$running_bin" ]; then
+ echo -e "\033[33mOrion Server is not running.\033[0m\n"
+ bin_path=
+ if [ -f ${server_name} ]; then
+ bin_path=${server_name}
+ else
+ if [ -f /usr/bin/${server_name} ]; then
+ bin_path="/usr/bin/${server_name}"
+ else
+ echo "Can not find Orion Server binary \"oriond\" in either `pwd` or /usr/bin."
+ fi
+ fi
+ if [ -r /etc/systemd/system/oriond.service ]; then
+ bin_path=$(cat /etc/systemd/system/oriond.service | grep -F 'ExecStart=' | awk -F '=' '{print $2}')
+ echo "Orion Server has been registered as system service. Using binary $bin_path to infer the runtime environment."
+ ld_path=$(cat /etc/systemd/system/oriond.service | grep -F 'Environment="LD_LIBRARY_PATH=' | awk -F '[="]' '{print $4}')
+ path_path=$(cat /etc/systemd/system/oriond.service | grep -F 'Environment="PATH=' | awk -F '[="]' '{print $4}')
+ if [ -n "$ld_path" ]; then
+ export LD_LIBRARY_PATH=$ld_path
+ echo "Injecting oriond service environment LD_LIBRARY_PATH=$ld_path"
+ fi
+ if [ -n "$path_path" ]; then
+ export PATH=$path_path
+ echo "Injecting oriond service environment PATH=$path_path"
+ fi
+ fi
+ if [ -n "$bin_path" ]; then
+ unfound_lib=$(ldd ${bin_path} | grep "not found" | awk '{print $1}')
+ if [ -n "$unfound_lib" ]; then
+ echo -e "\033[31mFollowing libraries are needed but not found in current environment:\033[0m"
+ echo "$unfound_lib"
+ return 1
+ fi
+ fi
+ controller_addr_env=$ORION_CONTROLLER
+ if [ -n "$controller_addr_env" ]; then
+ if echo $controller_addr_env | grep ":" > /dev/null; then
+ echo "Environment variable ORION_CONTROLLER=$controller_addr_env is set in current SHELL."
+ else
+ echo "Environment variable ORION_CONTROLLER=$controller_addr_env is set in current SHELL."
+ echo "\033[33m[Warning] Invalid format. No port is specified.\033[0m"
+ fi
+ fi
+ controller_addr=
+ if [ -r /etc/orion/server.conf ]; then
+ controller_addr=$(sed -n 's/^\s*controller_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*:[0-9]*\).*/\1/p' /etc/orion/server.conf)
+ bind_ip=$(sed -n 's/^\s*bind_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\).*/\1/p' /etc/orion/server.conf)
+ bind_port=$(sed -n 's/^\s*listen_port\s*=\s*\([0-9]*\).*/\1/p' /etc/orion/server.conf)
+ controller_addr=${controller_addr:-""}
+ bind_ip=${bind_ip:-""}
+ bind_port=${bind_port:-"9960"}
+ echo ""
+ echo "Configuration file is found at /etc/orion/server.conf"
+ echo "Orion Server will connect to Orion Controller at $controller_addr unless the setting is overwritten by environment variable \"ORION_CONTROLLER\""
+ echo "Orion Server will listen on port $bind_port unless the setting is overwritten by -p option"
+ valid_ip=0
+ if ip addr > /dev/null 2>&1; then
+ while read line
+ do
+ if [ $bind_ip == ${line} ]; then
+ valid_ip=1
+ break
+ fi
+ done <<< "$(ip addr | grep -w inet | awk '{print $2}' | awk -F '/' '{print $1}')"
+ else
+ valid_ip=1
+ fi
+ if [ $valid_ip -eq 1 ]; then
+ echo "Orion Server will bind to address $bind_ip unless the setting is overwritten by -b option"
+ else
+ echo -e "\033[33mOrion Server is configured to bind at address \"${bind_ip}\" which may be invalid.\033[0m"
+ fi
+ cfg_enable_shm=0
+ cfg_enable_rdma=0
+ cfg_enable_kvm=0
+ if [ -f /etc/orion/server.conf ]; then
+ result=$(sed -n 's/^\s*enable_shm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_shm=1
+ fi
+ result=$(sed -n 's/^\s*enable_rdma\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_rdma=1
+ fi
+ result=$(sed -n 's/^\s*enable_kvm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_kvm=1
+ fi
+ fi
+ printf "%-40s" "Enable SHM"
+ if [ $cfg_enable_shm == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ printf "%-40s" "Enable RDMA"
+ if [ $cfg_enable_rdma == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ printf "%-40s" "Enable Local QEMU-KVM with SHM"
+ if [ $cfg_enable_kvm == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ else
+ echo "No configuration is set in the system. Default setting and environment variables will be used to configure Orion Server."
+ echo -e "Orion Server will connect to Orion Controller set by environment variable \033[32mORION_CONTROLLER\033[0m"
+ echo -e "Orion Server will bind to address \033[32m127.0.0.1\033[0m unless the setting is overwritten by \033[32m-b\033[0m option"
+ echo -e "Orion Server will listen on port \033[32m9960\033[0m unless the setting is overwritten by \033[32m-p\033[0m option"
+ bind_port=9960
+ fi
+ if which netstat > /dev/null 2>&1; then
+ result=$(netstat -tulpn 2>/dev/null | grep -w LISTEN | awk '{print $4}' | grep ":${bind_port}\>")
+ if [ -n "$result" ]; then
+ echo -e "\033[33m[Warning] Linux port $bind_port is in used by other program.\033[0m"
+ fi
+ fi
+ else
+ pid=$(ps -aux | awk '{print $2,$11}' | grep -v grep | grep -w "oriond" | awk '{print $1}')
+ pid_controller=$(strings /proc/${pid}/environ | grep ORION_CONTROLLER | awk -F '=' '{print $2}')
+ if [ -n "$pid_controller" ]; then
+ echo "Orion Server runs with environment ORION_CONTROLLER=${pid_controller}"
+ export ORION_CONTROLLER=${pid_controller}
+ fi
+ cfg_enable_shm=0
+ cfg_enable_rdma=0
+ cfg_enable_kvm=0
+ if [ -f /etc/orion/server.conf ]; then
+ result=$(sed -n 's/^\s*enable_shm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_shm=1
+ fi
+ result=$(sed -n 's/^\s*enable_rdma\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_rdma=1
+ fi
+ result=$(sed -n 's/^\s*enable_kvm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_kvm=1
+ fi
+ fi
+ user_name=$(ps -aux | grep -v grep | grep "oriond" | awk '{print $1}')
+ command_line=$(ps -aux | grep -v grep | grep "oriond" | awk '{for(i=11;i<=NF;i++){printf "%s ", $i}; printf "\n"}')
+ echo "Orion Server is running with Linux user : $user_name"
+ echo "Orion Server is running with command line : $command_line"
+ cudart_path=$(ls -l /proc/${pid}/map_files | grep libcudart | awk '{print $11}' | head -n 1 | awk -F 'so.' '{print $2}')
+ if [ -n "$cudart_path" ]; then
+ echo "Orion Server is running with CUDA version $cudart_path"
+ fi
+ cudnn_path=$(ls -l /proc/${pid}/map_files | grep libcudnn | awk '{print $11}' | head -n 1 | awk -F 'so.' '{print $2}')
+ if [ -n "$cudnn_path" ]; then
+ echo "Orion Server is running with CUDNN version $cudnn_path"
+ fi
+ enable_shm=0
+ enable_rdma=0
+ enable_kvm=0
+ bind_ip=
+ bind_port=9960
+ printf "%-40s" "Enable SHM"
+ if echo $command_line | grep -e " -m " > /dev/null; then
+ enable_shm=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_shm == 1 ]; then
+ enable_shm=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ printf "%-40s" "Enable RDMA"
+ if echo $command_line | grep -e " -r " > /dev/null; then
+ enable_rdma=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_rdma == 1 ]; then
+ enable_rdma=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ printf "%-40s" "Enable Local QEMU-KVM with SHM"
+ if echo $command_line | grep -e " -k " > /dev/null; then
+ enable_kvm=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_kvm == 1 ]; then
+ enable_kvm=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ if which netstat > /dev/null 2>&1; then
+ listen_addr=$(netstat -nap 2>/dev/null | grep oriond | grep LISTEN | awk '{print $4}' | sort | head -n 1)
+ if [ -n "$listen_addr" ]; then
+ bind_ip=$(echo $listen_addr | awk -F ':' '{print $1}')
+ listen_port=$(echo $listen_addr | awk -F ':' '{print $2}')
+ fi
+ else
+ if echo $command_line | grep -e " -b " > /dev/null; then
+ bind_ip=$(echo $command_line | sed -n 's/.*\s\+-b\s\+\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\).*/\1/p')
+ else
+ if [ -r /etc/orion/server.conf ]; then
+ bind_ip=$(sed -n 's/^\s*bind_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)/\1/p' /etc/orion/server.conf)
+ bind_port=$(sed -n 's/^\s*listen_port\s*=\s*\([0-9]*\)/\1/p' /etc/orion/server.conf)
+ bind_ip=${bind_ip:-""}
+ bind_port=${bind_port:-"9960"}
+ fi
+ fi
+ fi
+ printf "%-40s%s\n" "Binding IP Address :" "$bind_ip"
+ printf "%-40s%s\n\n" "Listening Port :" "$bind_port"
+ if which nc > /dev/null 2>&1; then
+ echo "Testing the Orion Server network ..."
+ if nc -zv $bind_ip $listen_port > /dev/null 2>&1; then
+ echo "Orion Server can be reached through $listen_addr"
+ else
+ echo "Orion Server can not be reached through $listen_addr"
+ echo "Please check the firewall setting."
+ fi
+ fi
+ fi
+if [ "$1" == "install" ]; then
+ if [ "$2" == "all" ]; then
+ check_os
+ find_rdma_support
+ find_cuda
+ find_cudnn
+ find_nvidia_gpu
+ find_mps_support
+ find_etcd
+ find_qemu_kvm
+ find_docker
+ check_server_install
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "CUDA :" "$summary_cuda_support"
+ printf "%-40s [%s]\n" "CUDNN :" "$summary_cudnn_support"
+ printf "%-40s [%s]\n" "NVIDIA GPU :" "$summary_nvidia_gpu_support"
+ printf "%-40s [%s]\n" "NVIDIA CUDA MPS :" "$summary_nvidia_mps"
+ printf "%-40s [%s]\n" "etcd service :" "$summary_etcd_support"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ printf "%-40s [%s]\n" "Orion Server binary:" "$summary_server_support"
+ elif [ "$2" == "server" ]; then
+ check_os
+ find_rdma_support
+ find_cuda
+ find_cudnn
+ find_nvidia_gpu
+ find_mps_support
+ find_qemu_kvm
+ find_docker
+ check_server_install
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "CUDA :" "$summary_cuda_support"
+ printf "%-40s [%s]\n" "CUDNN :" "$summary_cudnn_support"
+ printf "%-40s [%s]\n" "NVIDIA GPU :" "$summary_nvidia_gpu_support"
+ printf "%-40s [%s]\n" "NVIDIA CUDA MPS :" "$summary_nvidia_mps"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ printf "%-40s [%s]\n" "Orion Server binary:" "$summary_server_support"
+ elif [ "$2" == "client" ]; then
+ check_os
+ find_rdma_support
+ find_qemu_kvm
+ find_docker
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ elif [ "$2" == "controller" ]; then
+ check_os
+ find_etcd
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "etcd service :" "$summary_etcd_support"
+ if [ $summary_etcd_support == "No" ]; then
+ echo -e "\n\033[31mOrion Controller can not be installed in this environment.\033[0m"
+ fi
+ else
+ echo "Invalid parameters."
+ print_help
+ exit 1
+ fi
+elif [ "$1" == "runtime" ]; then
+ if [ "$2" == "server" ]; then
+ find_nvidia_gpu
+ find_mps_support
+ check_oriond_runtime
+ check_controller_runtime server
+ elif [ "$2" == "client" ]; then
+ check_controller_runtime client
+ else
+ echo "Invalid parameters."
+ print_help
+ exit 1
+ fi
+ echo "Invalid parameters."
+ print_help
+ exit 1
+; this is an example of orion configuration
+ listen_port = 9960
+ bind_addr =
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false"
+ log_with_time = 1
+ log_to_screen = 0
+ log_to_file = 1
+ log_level = INFO
+ file_log_level = INFO
+ shm_path_base = "/dev/shm/"
+ shm_group_name = "kvm"
+ shm_user_name = "libvirt-qemu"
+ shm_buffer_size = 134217728
+ controller_addr =
