`共享内存?
+
+如果是以上问题,解决后需要重启Orion Server才可以生效。
+
+Orion Client的安装设置问题,用户可以重新参考quick-start中[使用本地Docker容器](container.md)章节。防火墙问题可以参考[本附录相应小节](#firewall)。
+
+更全面的常见问题解答,用户可以参考用户手册[相关章节](../Orion-User-Guide.md#常见问题)。
diff --git a/doc/quick-start/container.md b/doc/quick-start/container.md
new file mode 100755
index 0000000..1beefaf
--- /dev/null
+++ b/doc/quick-start/container.md
@@ -0,0 +1,308 @@
+# 场景一:Docker 容器中使用本地节点GPU资源
+
+本章中我们用到的节点配置如下:
+* 单台工作站,配备一张NVIDIA GTX 1080Ti显卡,显存11GB
+* Ubuntu 16.04 LTS
+* Docker CE 18.09
+
+完成Orion vGPU软件部署后,我们将在普通的Docker容器(没有将物理GPU直通进容器内部,不依赖于`nvidia-docker`)中启动 Juypter Notebook,运行TensorFlow1.12,使用Orion vGPU资源进行 pix2pix 模型训练与推理。
+
+
+
+![Docker](./figures/arch-docker.png)
+
+
+
+进入后续步骤之前,我们假设:
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装并正常启动
+
+## **Orion Server 配置与启动**
+正如[Orion Server服务配置](README.md#server-config)中所介绍的,我们需要配置`/etc/orion/server.conf`文件中的两类属性:
+
+* Orion Server所接受的与Client通信的数据通路,即`bind_addr`
+* Orion Server运行的模式
+
+### Orion Server数据通路设置
+
+属性`bind_addr`指Orion Server所接受的数据通路,Client必须要能访问这一地址。对于本地容器环境来说,最简单的方法是在`docker run`的时候带上`--net host`参数,使得容器内部和物理机共享网络环境。此时,`bind_addr`使用默认的`127.0.0.1`即可。
+
+当然,这样的便利性是建立在牺牲容器和操作系统之间网络隔离的基础上。感兴趣的读者可以参考本场景最后一小节[使用独立Docker子网](#docker-native),在不使用`--net host`参数的情况下在容器中使用Orion vGPU资源。
+
+### Orion Server模式设置
+
+对于本地虚拟化环境,性能最佳的选择是通过共享内存来实现数据传输。为此,我们需要将`enable_shm`设置为`true`,并将`enable_rdma`设为`false`。
+
+此外,由于我们是容器环境,所以`enable_kvm`应当设为`false`。
+
+### 启动Orion Server
+
+根据上述讨论,`/etc/orion/server.conf`文件中的`[server]`小节参考配置如下(即默认值):
+
+```bash
+[server]
+ listen_port = 9960
+ bind_addr = 127.0.0.1
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false
+```
+
+更改了配置文件后,我们启动/重启Orion Server服务:
+
+```bash
+systemctl restart oriond
+```
+
+此时,我们可以使用`orion-check`工具检查Orion vGPU软件的状态:
+
+```bash
+orion-check runtime server
+```
+
+```bash
+# (omit output)
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [Yes]
+Enable RDMA [No]
+Enable Local QEMU-KVM with SHM [No]
+Binding IP Address : 127.0.0.1
+Listening Port : 9960
+
+Testing the Orion Server network ...
+Orion Server can be reached through 127.0.0.1:9960
+# (omit output)
+Orion Controller addrress is set as 127.0.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
+Address 127.0.0.1:9123 is reached.
+Orion Controller Version Infomation : api_version=0.1,data_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+```
+
+正常情况下,每块物理GPU会被虚拟化为4块Orion vGPU,它们都应该处于可用的状态。
+
+
+## **创建用于通信的共享内存**
+
+为了Orion Server和容器中的Client应用能够通过共享内存加速数据传输,我们需要在启动容器之前创建一块共享内存,在后续`docker run`时通过`-v`参数挂载到容器中。
+
+```bash
+orion-shm
+```
+
+上述命令会在`/dev/shm`目录下创建一块`128MB`的共享内存`/dev/shm/orionsock0`,可以通过`ls /dev/shm/`命令检查。
+
+需要注意的是:
+
+* 如果删除或覆盖了已经被使用过的`/dev/shm/orionsock` 共享内存,一定要重启Orion Server服务;
+
+* 如果同时运行多个container,每个container需要挂载单独的`/dev/shm/orionsock`共享内存。这些共享内存可以通过`orion-shm -i `来分别创建。
+
+## **Orion Client容器启动**
+
+### 获取带有Orion Client运行时环境的容器
+
+我们提供配置好Orion Client runtime的预先安装好官方原生TensorFlow 1.12的Docker镜像,以Python3.5版本为例:
+
+```bash
+docker pull virtaitech/orion-client:tf1.12-py3
+```
+
+有兴趣的读者可以参考我们的[Dockerfile](#)来构建自己的容器。
+
+### Orion Client端参数配置
+
+对于Orion Client端容器来说,需要设置以下环境变量:
+* `ORION_CONTROLLER=:9123`: Orion Client向Orion Controller申请Orion vGPU资源时发送RESTful API的网络地址。在本场景中,容器共享操作系统网络,所以`controller_ip`可以设置为`127.0.0.1`即可。
+* `ORION_VGPU`:容器中每个进程申请的Orion vGPU数目,默认情况下,最多可以申请4倍物理GPU数目的vGPU。本例中我们设置`ORION_VGPU=1`。
+* `ORION_GMEM`:申请的每个Orion vGPU所能使用的显存数目(单位:MB)。由于我们使用一张GTX 1080Ti显卡,显存上限为11G,我们设置`ORION_GMEM=10500`。
+
+### 启动容器
+
+如上文所述,我们在用`docker run`命令启动容器时,需要用`-e`参数设置上一节介绍的环境变量,并用`-v`参数将创建的`/dev/shm/orionsock0`共享内存挂载到容器内的`/dev/shm`目录下。为了方便运行Jupyter Notebook训练TensorFlow官方提供的`pix2pix`模型例子,我们假设执行`docker run`的目录下已经用
+
+```bash
+git clone https://github.com/tensorflow/tensorflow.git
+```
+
+将TF的repo克隆到本地。我们在运行容器时也同时将TF的repo挂载到容器内部。
+
+```bash
+docker run -it --rm \
+ -v /dev/shm/orionsock0:/dev/shm/orionsock0:rw \
+ -v $(pwd)/tensorflow:/root/tensorflow \
+ --net host \
+ -e ORION_CONTROLLER=127.0.0.1:9123 \
+ -e ORION_VGPU=1 \
+ -e ORION_GMEM=10500 \
+ virtaitech/orion-client:tf1.12-py3
+```
+
+读者可以在容器中通过`ls /dev | grep nvidia`确认容器中没有挂载NVIDIA显卡设备。
+
+在运行Jupyter Notebook之前,我们可以用`orion-check`工具检查Orion Client容器内部是否能正常与Orion Controller通信:
+
+```bash
+# From inside Orion Client container
+orion-check runtime client
+```
+
+正常情况下,输出应该是:
+
+```bash
+# (omit output)
+Environment variable ORION_CONTROLLER is set as 127.0.0.1:9123 Using this address to diagnose Orion Controller.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+```
+
+这样的输出说明Orion Client容器内部应用可以向Orion Controller申请资源。否则,用户应该先根据[Orion Controller安装部署](README.md#controller)章节检查Orion Controller的状态,再根据上文检查是否设置了正确的`ORION_CONTROLLER=:9123`。
+
+## **运行Jupyter Notebook**
+
+假定tensorflow文件夹已经被挂载进容器内部。
+
+```bash
+# From inside Orion Client container
+cd tensorflow/tensorflow/contrib/eager/python/examples/
+jupyter notebook --no-browser --allow-root
+```
+
+会看到如下的输出:
+
+```bash
+To access the notebook, open this file in a browser:
+ file:///root/.local/share/jupyter/runtime/nbserver-26-open.html
+Or copy and paste one of these URLs:
+ http://localhost:8888/?token=
+```
+
+如果用户可以使用这台机器的图形界面,那么可以打开浏览器,输入上述地址;
+
+否则,可以在有图形界面(能打开浏览器)的本地上进行SSH端口转发:
+
+```bash
+ssh -Nf -L 8888:localhost:8888
+```
+
+然后在本地浏览器里面输入地址访问Jupyter Notebook
+
+![Jupyter](./figures/pix2pix/jupyter.png)
+
+
+## **使用TensorFlow 1.12 Eager Execution模式进行 pix2pix 模型训练与推理**
+
+进入`pix2pix`目录,打开`pix2pix_eager.ipynb`,一路`shift+enter`执行每一个cell,就可以看到模型训练的情况:
+
+![train-epoch-2](./figures/pix2pix/train-epoch-2.png)
+
+训练的过程中,我们可以在容器外部通过`nvidia-smi`监视物理GPU的使用情况:
+
+![nvidia-smi](./figures/pix2pix/nvidia-smi.png)
+
+从图中可以看到,真正使用物理GPU的进程是Orion Server的进程`oriond`,而不是容器中正在执行TensorFlow训练任务的Python脚本。这是因为容器中的应用程序使用的是Orion vGPU资源,对物理GPU的访问完全由Orion Server所接管。
+
+训练完200个epochs以后,可以在测试集上运行模型:
+
+![inference](./figures/pix2pix/inference.png)
+
+如果运行有异常,用户可以参考[附录相应小节](appendix.md#trouble-client)进行检查。
+
+## (可选)使用独立的Docker子网
+
+本小节中,我们展示当容器使用独立的Docker子网时应当如何设置各项参数。我们假定读者已经熟悉并成功完成了本章前面介绍的,当容器使用`--net host`参数启动时使用Orion vGPU的整套流程,因此在本小节中我们只列出与前文不同之处。
+
+### Orion Server 数据通路设置
+
+设置数据通路`bind_addr`的关键,在于确保Orion Client端可以通过这一地址与Orion Server进行数据交互。对于现在的场景,在容器内的应用默认只能访问Docker子网,所以我们需要把`bind_addr`设为Docker子网的网关(gateway)。
+
+我们在物理机上运行`ifconfig`以检查网络设置。一般来说,Docker默认安装后建立的子网为`docker0`:
+
+```bash
+docker0 Link encap:Ethernet HWaddr 02:42:46:9f:27:13
+ inet addr:172.17.0.1 Bcast:172.17.255.255 Mask:255.255.0.0
+ inet6 addr: fe80::42:46ff:fe9f:2713/64 Scope:Link
+ UP BROADCAST MULTICAST MTU:1500 Metric:1
+ RX packets:416541 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:652846 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:0
+ RX bytes:24865042 (24.8 MB) TX bytes:3116550526 (3.1 GB)
+```
+
+因此,我们应当设置`bind_addr`为`docker0`子网的网关`172.17.0.1`。
+
+### Orion Server 参数配置示例
+
+更新后的`/etc/orion/server.conf`文件中的`[server]`小节参考配置如下:
+
+```bash
+[server]
+ listen_port = 9960
+ bind_addr = 172.17.0.1
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false
+```
+
+需要注意的是,若`oriond`之前已经启动,为了使配置文件生效,需要重启`oriond`服务:
+
+```bash
+systemctl restart oriond
+```
+
+### Orion Client 参数设置
+
+这里我们将Orion Controller地址设置为Docker子网网关地址:`ORION_CONTROLLER=172.17.0.1:9123`,从而确保容器中的应用程序向Orion Controller发送的资源请求可以被Orion Controller收到。
+
+### 运行容器
+
+除了上述修改外,为了容器外部可以在浏览器中访问Jupyter Notebook,我们还需要将`8888`端口通过`-p 8888:8888`暴露到外部。最终启动容器的命令为:
+
+```bash
+docker run -it --rm \
+ -v /dev/shm/orionsock0:/dev/shm/orionsock0:rw \
+ -v $(pwd)/tensorflow:/root/tensorflow \
+ -p 8888:8888 \
+ -e ORION_CONTROLLER=172.17.0.1:9123 \
+ -e ORION_VGPU=1 \
+ -e ORION_GMEM=10500 \
+ virtaitech/orion-client:tf1.12-py3
+```
+
+同样,我们需要在容器内部检查是否能向Orion Controller申请资源:
+
+```bash
+# From inside Orion Client container
+orion-check runtime client
+```
+
+正常情况下,输出应该是:
+
+```bash
+# (omit output)
+Environment variable ORION_CONTROLLER is set as 172.17.0.1:9123 Using this address to diagnose Orion Controller.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 4 vGPU under managered by Orion Controller. 4 vGPU are free now.
+```
+
+这样的输出说明Orion Client容器内部应用可以向Orion Controller申请资源。否则,用户应该先根据[Orion Controller安装部署](README.md#controller)章节检查Orion Controller的状态,再根据上文检查是否设置了正确的`ORION_CONTROLLER=:9123`。
+
+注:在使用独立Docker子网时,可能会遇到防火墙相关问题,读者可以参考附录[防火墙设置](appendix.md#firewall)小节的内容进行检查和设置。
+
+## **运行Jupyter Notebook**
+
+假定tensorflow文件夹已经被挂载进容器内部。此时,我们运行Juypter Notebook时要显式指定`--ip=0.0.0.0`,
+
+```bash
+cd tensorflow/tensorflow/contrib/eager/python/examples/
+jupyter notebook --ip=0.0.0.0 --no-browser --allow-root
+```
+
+然后再在有图形界面的节点(例如Laptop)上通过SSH端口转发,
+
+```bash
+ssh -Nf -L 8888:localhost:8888
+```
+
+从而最终可以在浏览器中通过`localhost:8888/?token=`访问Jupyter Notebook。
\ No newline at end of file
diff --git a/doc/quick-start/figures/arch-docker.png b/doc/quick-start/figures/arch-docker.png
new file mode 100755
index 0000000..8410bde
Binary files /dev/null and b/doc/quick-start/figures/arch-docker.png differ
diff --git a/doc/quick-start/figures/arch-kvm.png b/doc/quick-start/figures/arch-kvm.png
new file mode 100755
index 0000000..7f8f903
Binary files /dev/null and b/doc/quick-start/figures/arch-kvm.png differ
diff --git a/doc/quick-start/figures/arch-local.png b/doc/quick-start/figures/arch-local.png
new file mode 100755
index 0000000..1b5dabd
Binary files /dev/null and b/doc/quick-start/figures/arch-local.png differ
diff --git a/doc/quick-start/figures/arch-rdma.png b/doc/quick-start/figures/arch-rdma.png
new file mode 100755
index 0000000..976ac9d
Binary files /dev/null and b/doc/quick-start/figures/arch-rdma.png differ
diff --git a/doc/quick-start/figures/architecture.png b/doc/quick-start/figures/architecture.png
new file mode 100755
index 0000000..11a5976
Binary files /dev/null and b/doc/quick-start/figures/architecture.png differ
diff --git a/doc/quick-start/figures/cifar10/nvidia-smi.png b/doc/quick-start/figures/cifar10/nvidia-smi.png
new file mode 100755
index 0000000..ea49842
Binary files /dev/null and b/doc/quick-start/figures/cifar10/nvidia-smi.png differ
diff --git a/doc/quick-start/figures/inception3/nvidia-smi-synthetic.png b/doc/quick-start/figures/inception3/nvidia-smi-synthetic.png
new file mode 100755
index 0000000..75e763a
Binary files /dev/null and b/doc/quick-start/figures/inception3/nvidia-smi-synthetic.png differ
diff --git a/doc/quick-start/figures/pix2pix/inference.png b/doc/quick-start/figures/pix2pix/inference.png
new file mode 100755
index 0000000..bce7d8c
Binary files /dev/null and b/doc/quick-start/figures/pix2pix/inference.png differ
diff --git a/doc/quick-start/figures/pix2pix/jupyter.png b/doc/quick-start/figures/pix2pix/jupyter.png
new file mode 100755
index 0000000..7e59a4c
Binary files /dev/null and b/doc/quick-start/figures/pix2pix/jupyter.png differ
diff --git a/doc/quick-start/figures/pix2pix/nvidia-smi.png b/doc/quick-start/figures/pix2pix/nvidia-smi.png
new file mode 100755
index 0000000..66365a7
Binary files /dev/null and b/doc/quick-start/figures/pix2pix/nvidia-smi.png differ
diff --git a/doc/quick-start/figures/pix2pix/train-epoch-2.png b/doc/quick-start/figures/pix2pix/train-epoch-2.png
new file mode 100755
index 0000000..486b963
Binary files /dev/null and b/doc/quick-start/figures/pix2pix/train-epoch-2.png differ
diff --git a/doc/quick-start/kvm.md b/doc/quick-start/kvm.md
new file mode 100755
index 0000000..f1cf994
--- /dev/null
+++ b/doc/quick-start/kvm.md
@@ -0,0 +1,329 @@
+# 场景二:KVM 虚拟机中使用本地节点GPU资源
+
+本章中我们用到的节点配置如下:
+* 单台服务器,配备两张NVIDIA Tesla V100计算卡,每张16GB显存
+* Ubuntu Server 16.04 LTS
+* Docker CE 18.09
+* libvirt 1.3.1
+* QEMU 2.5.0
+
+我们以安装ubuntu16.04操作系统的一台虚拟机`ubuntu-client0`作为Orion Client。这台虚拟机既没有将物理机上的显卡以直通(Passthrough)的方式穿透进来,也没有安装NVIDIA驱动或CUDA组件。我们安装了必要的Python3库,以及TensorFlow 1.12 GPU版本:
+
+```bash
+# From inside VM
+sudo apt install python3-dev python3-pip
+sudo pip3 install tensorflow-gpu==1.12.0
+```
+
+由于虚拟机中不能访问GPU,也没有NVIDIA的软件环境,TensorFlow目前是无法使用的。在配置好Orion vGPU软件后,虚拟机中的TensorFlow就可以使用Orion vGPU进行模型的训练与推理。
+
+完成Orion vGPU软件部署后,我们将在`ubuntu-client0`中运行TensorFlow官方的CIFAR10_Estimator示例,使用两块Orion vGPU (分别位于两块物理GPU上)进行模型训练与推理。
+
+
+
+![KVM](./figures/arch-kvm.png)
+
+
+
+进入后续步骤之前,我们假设:
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装并正常启动
+
+
+## **Orion Server 配置启动**
+
+在启动Orion Server服务之前,我们需要修改配置文件,设置数据通路,并打开Orion Server对KVM的支持。
+
+### 数据通路设置
+
+属性`bind_addr`指Orion Server所接受的数据通路,Client必须要能访问这一地址。对于KVM虚拟机来说,我们需要将其设置为KVM虚拟机网络的网关地址。
+
+我们使用`virsh`查看KVM虚拟机的网络配置:
+
+```bash
+# From host OS
+sudo virsh domifaddr ubuntu-client0
+```
+
+
+```bash
+ Name MAC address Protocol Address
+-------------------------------------------------------------------------------
+ vnet1 52:54:00:04:82:10 ipv4 33.31.0.10/24
+```
+
+可以看到,当前KVM虚拟机的IP地址为`33.31.0.10`,因此应该设置`bind_addr=33.31.0.1`。
+
+附:如果KVM虚拟机在多个虚拟子网内,例如:
+
+```bash
+ Name MAC address Protocol Address
+-------------------------------------------------------------------------------
+ vnet0 52:54:00:04:82:10 ipv4 33.31.0.10/24
+ vnet1 52:54:00:c5:43:10 ipv4 33.32.0.10/24
+```
+
+选择任意一个子网的网关作为`bind_addr`均可。
+
+### Orion Server模式设置
+
+本场景中,我们仍然选择本地共享内存来加速数据传输,因此需要设置`enable_shm=true`,`enable_rdma=false`。此外,我们要显式启用Orion vGPU软件对KVM虚拟机的支持,即设置`enable_kvm=true`。
+
+### Orion Server 参数配置示例
+本场景中,`/etc/orion/server.conf`的第一小节内容应该配置为
+```bash
+[server]
+ listen_port = 9960
+ bind_addr = 33.31.0.1
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "true"
+```
+
+### 启动Orion Server
+我们需要重启Orion Server使新配置生效,并通过`orion-check`工具进一步确认Orion Server和Orion Controller可以正常交互:
+
+```bash
+# From host OS
+sudo systemctl restart oriond
+sudo orion-check runtime server
+```
+
+正常的输出类似下面所示:
+
+``` bash
+Searching NVIDIA GPU ...
+CUDA driver 418.67
+418.67 is installed.
+2 NVIDIA GPUs are found :
+ 0 : Tesla V100-PCIE-16GB
+ 1 : Tesla V100-PCIE-16GB
+
+Checking NVIDIA MPS ...
+NVIDIA CUDA MPS is off.
+
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [Yes]
+Enable RDMA [No]
+Enable Local QEMU-KVM with SHM [Yes]
+Binding IP Address : 33.31.0.1
+Listening Port : 9960
+
+Testing the Orion Server network ...
+Orion Server can be reached through 33.31.0.1:9960
+
+Checking Orion Controller status ...
+[Info] Orion Controller setting may be different in different SHELL.
+[Info] Environment variable ORION_CONTROLLER has the first priority.
+
+Orion Controller addrress is set as 127.0.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
+Address 127.0.0.1:9123 is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+```
+
+可以看到,我们的Orion Server节点上有两块Tesla V100计算卡,Orion Controller将它们虚拟化成了一共8块Orion vGPU。
+
+## 在虚拟机内安装Orion Client运行时
+
+### 安装至默认路径
+
+在虚拟机中,我们运行Orion Client安装包:
+
+```bash
+# From inside VM
+sudo ./install-client
+```
+
+此时,用户没有指定安装路径,安装包会询问是否将Orion Client运行时安装到默认路径`/usr/lib/orion`下。得到用户许可后,安装包会通过`ldconfig`机制将Orion Client运行时添加到操作系统动态库搜索路径。
+
+```bash
+Orion client environment will be installed to /usr/lib/orion
+Do you want to continue [n/y] ?y
+
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+
+Orion vGPU client environment has been installed in /usr/lib/orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/usr/lib/orion:$LD_LIBRARY_PATH
+```
+
+由于安装包已经配置搜索路径,这里屏幕提示的`export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH`不是必需的。
+
+### (可选)安装至自定义路径
+以安装到`/orion`为例:
+
+```bash
+# From inside VM
+INSTALLATION_PATH=/orion
+sudo mkdir -p $INSTALLATION_PATH
+sudo ./install-client -d $INSTALLATION_PATH
+```
+这种情形下,安装包会直接将Orion Client运行时安装到用户指定的`INSTALLATION_PATH=/orion`路径下,并向屏幕输出下列提示:
+
+```bash
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+
+Orion vGPU client environment has been installed in /orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/orion:$LD_LIBRARY_PATH
+```
+
+用户在terminal内运行应用程序之前,一定要保证Orion Client运行时在操作系统动态库搜索路径中:
+
+```bash
+# From current working terminal inside VM
+export LD_LIBRARY_PATH=/usr/local/orion:$LD_LIBRARY_PATH
+```
+
+注意这条命令只对当前terminal生效。为方便起见,用户可以将上述语句加至`~/.bashrc`的最后一行,然后用`source ~/.bashrc`使其生效,此后登录虚拟机便不需要反复设置。
+
+
+## Orion Client参数配置
+
+正如[使用Docker容器](#container.md)中所介绍的,Orion Client端需要向Orion Controller发送对Orion vGPU资源的申请。对于容器环境,我们是通过在启动容器时用`ORION_CONTROLLER=:9123`环境变量设置Orion Controller的地址。对于KVM虚拟机来说,我们可以更改`/etc/orion/client.conf`来达到参数配置的目的。
+
+由于Orion Controller监听在物理机上的`0.0.0.0:9123`上,我们将`controller_addr`设置为虚拟机子网网关地址即可:
+
+```bash
+[controller]
+ controller_addr = 33.31.0.1:9123
+```
+
+设置完后,我们用`orion-check`工具检查状态:
+
+```bash
+# From inside VM
+orion-check runtime client
+```
+
+如果Orion Client虚拟机内部可以连接到Orion Controller,输出为:
+
+```bash
+# (omit output)
+Orion Controller addrress is set as 33.31.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
+Address 33.31.0.1:9123 is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+```
+
+## 运行TF官方CIFAR10_Estimator示例
+
+在运行应用程序之前,我们用环境变量指定应用程序向Orion Controller申请的Orion vGPU数目与显存:
+
+```bash
+export ORION_VGPU=2
+export ORION_GMEM=12000
+```
+
+我们的每一块Tesla V100计算卡有16GB显存,因此如果用户将`ORION_GMEM`设置得少于8GB,两块Orion vGPU会被调度到同一块物理GPU上。这里我们设置Orion vGPU的显存为12000MB,那么两块Orion vGPU将分别调度到两块物理GPU上,方便我们展示双卡的模型训练。
+
+下面我们使用TensorFlow官方的CIFAR10 Estimator例子展示模型的训练与推理:
+https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/README.md
+
+首先,我们`git clone`TF官方模型repo:
+
+```bash
+# From inside VM
+git clone https://github.com/tensorflow/models
+```
+
+然后进入CIFAR10 Estimator文件夹内
+
+```bash
+cd models/tutorials/image/cifar10_estimator/
+```
+
+读者第一次训练模型前,需要下载CIFAR10数据集并转换为TFRecord格式:
+
+```bash
+mkdir data
+python3 generate_cifar10_tfrecords.py --data-dir ./data
+```
+
+处理好后,`data`目录应该包括以下内容,共520MB:
+
+```bash
+user@ubuntu-client0:~/models/tutorials/image/cifar10_estimator/data$ ls
+cifar-10-batches-py cifar-10-python.tar.gz eval.tfrecords train.tfrecords validation.tfrecords
+```
+
+下面我们使用两块Orion vGPU进行模型训练,每块Orion vGPU上的batch_size设为128,总共256:
+
+```bash
+python3 cifar10_main.py \
+ --data-dir=${PWD}/data \
+ --job-dir=/tmp/cifar10 \
+ --variable-strategy=GPU \
+ --num-gpus=2 \
+ --train-steps=10000 \
+ --train-batch-size=256 \
+ --learning-rate=0.1
+```
+
+TensorFlow打印的日志如下:
+
+```bash
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+2019-06-25 15:43:43.493814: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2019-06-25 15:43:43.493882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:00:09.0
+totalMemory: 11.72GiB freeMemory: 11.72GiB
+2019-06-25 15:43:43.604945: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
+2019-06-25 15:43:43.605002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:00:09.0
+totalMemory: 11.72GiB freeMemory: 11.72GiB
+2019-06-25 15:43:43.606527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
+2019-06-25 15:43:43.606568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
+2019-06-25 15:43:43.606577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
+2019-06-25 15:43:43.606582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
+2019-06-25 15:43:43.606589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
+2019-06-25 15:43:43.606657: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11400 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
+2019-06-25 15:43:43.607202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 11400 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 7.0)
+# (omit output)
+INFO:tensorflow:global_step/sec: 14.2797
+INFO:tensorflow:loss = 0.48649728, step = 9900 (7.003 sec)
+INFO:tensorflow:learning_rate = 0.1, loss = 0.48649728 (7.003 sec)
+INFO:tensorflow:Average examples/sec: 3639.99 (4009.58), step = 9900
+INFO:tensorflow:Average examples/sec: 3640.07 (3717.26), step = 9910
+INFO:tensorflow:Average examples/sec: 3640.09 (3655.01), step = 9920
+INFO:tensorflow:Average examples/sec: 3640.31 (3873.63), step = 9930
+INFO:tensorflow:Average examples/sec: 3640.45 (3788.08), step = 9940
+INFO:tensorflow:Average examples/sec: 3640.79 (4017.58), step = 9950
+INFO:tensorflow:Average examples/sec: 3641.19 (4089.74), step = 9960
+INFO:tensorflow:Average examples/sec: 3641.23 (3679.08), step = 9970
+INFO:tensorflow:Average examples/sec: 3641.43 (3847.37), step = 9980
+INFO:tensorflow:Average examples/sec: 3641.4 (3615.53), step = 9990
+INFO:tensorflow:Saving checkpoints for 10000 into /tmp/cifar10/model.ckpt.
+INFO:tensorflow:Loss for final step: 0.46667284.
+# (omit output)
+INFO:tensorflow:Evaluation [100/100]
+INFO:tensorflow:Finished evaluation at 2019-06-25-08:06:14
+INFO:tensorflow:Saving dict for global step 10000: accuracy = 0.7628, global_step = 10000, loss = 1.0683168
+INFO:tensorflow:Saving 'checkpoint_path' summary for global step 10000: /tmp/cifar10/model.ckpt-10000
+2019-06-25 16:06:15 [INFO] Client exits with allocation ID fda38164-711b-4809-9984-2759a3a2165b
+```
+
+从日志中可以看到:
+
+* 应用程序启动时,Orion Client运行时会打印日志`VirtaiTech Resource. Build-cuda-xxx`。这一行日志说明应用程序成功加载了Orion Client运行时。
+* 应用程序退出时,Orion Client运行时会打印日志`Client exits with allocation ID xxx`。这一行日志说明应用程序在生命周期里成功向Orion Controller申请到了Orion vGPU资源,退出时释放这一资源。
+* TensorFlow启动时识别出了两块GPU,显存各自为11.72GB (对应于我们设置的`ORION_GMEM=12000`)
+
+模型训练的过程中,我们在物理机操作系统中运行`nvidia-smi`查看物理GPU使用情况:
+
+![cifar10-nvidia-smi](./figures/cifar10/nvidia-smi.png)
+
+从结果中可以看出:
+
+* 对物理GPU的访问被Orion Server进程`oriond`完全接管
+* 两块Orion vGPU被调度到了两块物理GPU上
+* 我们限制了Orion vGPU对显存的占用
+
+如果运行有异常,用户可以参考[附录相应小节](appendix.md#trouble-client)进行检查。
\ No newline at end of file
diff --git a/doc/quick-start/remote_rdma.md b/doc/quick-start/remote_rdma.md
new file mode 100755
index 0000000..c3c7d3f
--- /dev/null
+++ b/doc/quick-start/remote_rdma.md
@@ -0,0 +1,391 @@
+# 场景三:在没有GPU的节点上使用远程节点GPU资源
+
+本章中我们用到的环境配置如下:
+* 两台服务器,其中一台(`server1`)配备两张NVIDIA Tesla V100计算卡,每张16GB显存;另一台(`server0`)没有配备物理GPU
+* 两台服务器上均配备Mellanox ConnectX-5 25Gb网卡,通过光交换机互相连接
+* MLNX_OFED_LINUX04.5.2驱动及用户库
+* Ubuntu Server 16.04 LTS
+
+我们将Orion Server和Orion Controller均部署在配备有物理GPU的`server1`上,将`server0`作为Orion Client。我们在`server0`上安装了必要的Python3库以及TensorFlow 1.12 GPU版本:
+
+```bash
+# On server0 (which has no GPU)
+sudo apt install python3-dev python3-pip
+sudo pip3 install tensorflow-gpu==1.12.0
+```
+
+`server0`上没有物理GPU,也没有NVIDIA的软件环境,因此TensorFlow并不能使用GPU加速训练。一旦Orion vGPU软件部署完成,我们就可以在`server0`上运行TensorFlow通过Orion vGPU加速模型训练。我们会将TensorFlow official benchmark运行在随机生成数据、真实Imagenet数据两种场景下。
+
+
+
+![RDMA](./figures/arch-rdma.png)
+
+
+
+进入后续步骤之前,我们假设:
+* Orion Server已经根据[Orion Server安装部署](README.md#server)小节成功安装在`server1`上
+* Orion Controller已经根据[Orion Controller安装部署](README.md#controller)小节安装在`server1`上并正常启动
+
+## Orion Server 参数配置
+
+在启动`server1`上的Orion Server服务前,我们需要修改配置文件,设置数据通路使得`server0`上的Orion Client可以连接到Orion Server;此外,我们需要关闭默认的共享内存模式,打开RDMA通道。
+
+### 数据通路设置
+
+由于我们要使用RDMA加速数据交换,我们应该将`bind_addr`设置为Orion Client在RDMA网段可以访问的地址。以本场景环境为例,两台服务器节点所在的RDMA网段为`192.168.25.xxx`,因此我们将`bind_addr`设置为Orion Server所在`server1`的RDMA网段地址`192.168.25.21`。
+
+### Orion Server模式设置
+我们选择RDMA加速数据传输,因此需要设置`enable_shm=false`,`enable_rdma=true`。此外,我们的Orion Client并非本地KVM虚拟机,因此我们要设置`enable_kvm=false`。
+
+### Orion Server 参数配置示例
+本场景中,`/etc/orion/server.conf`的第一小节内容应该配置为
+```bash
+[server]
+ listen_port = 9960
+ bind_addr = 192.168.25.21
+ enable_shm = "false"
+ enable_rdma = "true"
+ enable_kvm = "false"
+```
+
+### 启动Orion Server
+我们需要重启Orion Server使新配置生效,并检查状态:
+
+```bash
+# From server1
+sudo systemctl restart oriond
+
+sudo orion-check runtime server
+```
+
+正常的输出如下:
+
+```bash
+Searching NVIDIA GPU ...
+CUDA driver 418.67
+418.67 is installed.
+2 NVIDIA GPUs are found :
+ 0 : Tesla V100-PCIE-16GB
+ 1 : Tesla V100-PCIE-16GB
+
+Checking NVIDIA MPS ...
+NVIDIA CUDA MPS is off.
+
+Checking Orion Server status ...
+Orion Server is running with Linux user : root
+Orion Server is running with command line : /usr/bin/oriond
+Enable SHM [No]
+Enable RDMA [Yes]
+Enable Local QEMU-KVM with SHM [No]
+Binding IP Address : 192.168.25.21
+Listening Port : 9960
+
+Testing the Orion Server network ...
+Orion Server can be reached through 192.168.25.21:9960
+
+Checking Orion Controller status ...
+[Info] Orion Controller setting may be different in different SHELL.
+[Info] Environment variable ORION_CONTROLLER has the first priority.
+
+Orion Controller addrress is set as 127.0.0.1:9123 in configuration file. Using this address to diagnose Orion Controller
+Address 127.0.0.1:9123 is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+```
+
+表明Orion Controller将两块物理GPU一共虚拟成了8块Orion vGPU,目前均处于可用状态。
+
+## 安装Orion Client运行时
+
+### 安装至默认路径
+
+在`server0`中,我们运行Orion Client安装包:
+
+```bash
+# From inside server0
+sudo ./install-client
+```
+
+此时,用户没有指定安装路径,安装包会询问是否将Orion Client运行时安装到默认路径`/usr/lib/orion`下。得到用户许可后,安装包会通过`ldconfig`机制将Orion Client运行时添加到操作系统动态库搜索路径。
+
+```bash
+Orion client environment will be installed to /usr/lib/orion
+Do you want to continue [n/y] ?y
+
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+
+Orion vGPU client environment has been installed in /usr/lib/orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/usr/lib/orion:$LD_LIBRARY_PATH
+```
+
+由于安装包已经配置搜索路径,这里屏幕提示的`export LD_LIBRARY_PATH=:$LD_LIBRARY_PATH`不是必需的。
+
+### (可选)安装至自定义路径
+以安装到`/orion`为例:
+
+```bash
+# From inside server0
+INSTALLATION_PATH=/orion
+sudo mkdir -p $INSTALLATION_PATH
+sudo ./install-client -d $INSTALLATION_PATH
+```
+这种情形下,安装包会直接将Orion Client运行时安装到用户指定的`INSTALLATION_PATH=/orion`路径下,并向屏幕输出下列提示:
+
+```bash
+Configuration file is generated to /etc/orion/client.conf
+Please edit the "controller_addr" setting and make it point to the controller address in your environment.
+
+Orion vGPU client environment has been installed in /orion
+To run application with Orion vGPU environment, please make sure Orion environment is loaded. e.g.
+export LD_LIBRARY_PATH=/orion:$LD_LIBRARY_PATH
+```
+
+用户在terminal内运行应用程序之前,一定要保证Orion Client运行时在操作系统动态库搜索路径中:
+
+```bash
+# From current working terminal inside server0
+export LD_LIBRARY_PATH=/usr/local/orion:$LD_LIBRARY_PATH
+```
+
+注意这条命令只对当前terminal生效。为方便起见,用户可以将上述语句加至`~/.bashrc`的最后一行,然后用`source ~/.bashrc`使其生效,此后以当前用户身份登录`server0`不需要反复设置。
+
+
+## Orion Client参数配置
+
+Orion Client端需要向Orion Controller发送对Orion vGPU资源的申请。我们可以更改`/etc/orion/client.conf`来达到参数配置的目的。
+
+由于Orion Controller监听在`server1`上的`0.0.0.0:9123`上,我们将`controller_addr`设置为`server1`的任意一个从`server0`能够访问的IP地址即可。此处我们可以依旧使用RDMA网段的地址`192.168.25.21`,或者使用`server1`的一个TCP地址`10.10.1.21`。我们选择后者作为示范。
+
+```bash
+[controller]
+ controller_addr = 10.10.1.21:9123
+```
+
+设置完后,我们用`orion-check`工具检查状态:
+
+```bash
+# From inside server0
+orion-check runtime client
+```
+
+如果Orion Client `server0`内部可以连接到Orion Controller,输出为:
+
+```bash
+# (omit output)
+Orion Controller addrress is set as 10.10.1.21:9123 in configuration file. Using this address to diagnose Orion Controller
+Address 10.10.1.21:9123 is reached.
+Orion Controller Version Infomation : data_version=0.1,api_version=0.1
+There are 8 vGPU under managered by Orion Controller. 8 vGPU are free now.
+```
+
+## 运行TF Official Benchmark
+在`server0`上运行应用程序之前,我们用环境变量指定应用程序向Orion Controller申请的Orion vGPU数目与显存:
+
+```bash
+export ORION_VGPU=2
+export ORION_GMEM=15500
+```
+
+我们的每一块Tesla V100计算卡有16GB显存,因此如果用户将`ORION_GMEM`设置得少于8GB,两块Orion vGPU会被调度到同一块物理GPU上。这里我们设置Orion vGPU的显存为15500MB,那么两块Orion vGPU将分别调度到两块物理GPU上,方便我们展示双卡的模型训练。
+
+首先,我们将TF official benchmark repo克隆下来:
+
+```bash
+# From inside server0
+git clone --branch=cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
+```
+
+TF official benchmark支持两种运行模式:随机生成数据,或者用转换为TFRecord格式的Imagenet数据集。我们分别介绍这两种情形。
+
+### 使用随机生成数据(Synthetic data)
+
+下面的代码会使用两块Orion vGPU训练inception_v3模型,每块vGPU上的batch_size=128, 总batch_size为256。
+
+对于随机数据来说,不可能提升训练精度,因此我们只训练500个batch即可:
+
+```bash
+python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+ --data_name=imagenet \
+ --model=inception3 \
+ --optimizer=rmsprop \
+ --num_batches=500 \
+ --num_gpus=2 \
+ --batch_size=128
+```
+输出为
+```bash
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+2019-06-25 19:55:37.099719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:d9:00.0
+totalMemory: 15.14GiB freeMemory: 15.14GiB
+2019-06-25 19:55:37.218239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
+name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
+pciBusID: 0000:d9:00.0
+totalMemory: 15.14GiB freeMemory: 15.14GiB
+2019-06-25 19:55:37.222562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
+2019-06-25 19:55:37.222765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
+2019-06-25 19:55:37.222795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
+2019-06-25 19:55:37.222815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y
+2019-06-25 19:55:37.222831: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N
+2019-06-25 19:55:37.222994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14725 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d9:00.0, compute capability: 7.0)
+2019-06-25 19:55:37.225850: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14725 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0000:d9:00.0, compute capability: 7.0)
+TensorFlow: 1.12
+Model: inception3
+Dataset: imagenet (synthetic)
+Mode: BenchmarkMode.TRAIN
+SingleSess: False
+Batch size: 256 global
+ 128.0 per device
+Num batches: 500
+Num epochs: 0.10
+Devices: ['/gpu:0', '/gpu:1']
+Data format: NCHW
+Optimizer: rmsprop
+Variables: parameter_server
+==========
+Generating training model
+# (omit output)
+Running warm up
+Done warm up
+Step Img/sec total_loss
+1 images/sec: 442.6 +/- 0.0 (jitter = 0.0) 7.416
+# (omit output)
+490 images/sec: 434.6 +/- 1.2 (jitter = 19.5) 7.384
+500 images/sec: 435.1 +/- 1.2 (jitter = 19.9) 7.378
+----------------------------------------------------------------
+total images/sec: 435.00
+----------------------------------------------------------------
+2019-06-25 20:01:06 [INFO] Client exits with allocation ID b928be93-0b40-4252-b6b4-291ca4c99462
+```
+
+从日志中可以看到:
+
+* 应用程序启动时,Orion Client运行时会打印日志`VirtaiTech Resource. Build-cuda-xxx`。这一行日志说明应用程序成功加载了Orion Client运行时。
+* 应用程序退出时,Orion Client运行时会打印日志`Client exits with allocation ID xxx`。这一行日志说明应用程序在生命周期里成功向Orion Controller申请到了Orion vGPU资源,退出时释放这一资源。
+* TensorFlow启动时识别出了两块GPU,显存各自为15.14GB (对应于我们设置的`ORION_GMEM=15500`)
+
+模型训练的过程中,我们在物理机操作系统中运行`nvidia-smi`查看物理GPU使用情况:
+
+![inception3-nvidia-smi-synthetic](./figures/inception3/nvidia-smi-synthetic.png)
+
+从结果中可以看出:
+
+* 对物理GPU的访问被Orion Server进程`oriond`完全接管
+* 两块Orion vGPU被调度到了两块物理GPU上
+* 我们限制了Orion vGPU对显存的占用
+
+### (可选)使用TFRecord格式的Imagenet数据集
+
+首先,用户需要按照
+
+https://github.com/tensorflow/models/tree/master/research/inception#getting-started
+
+里的步骤下载Imagenet数据集,并转换成TFRecord格式。如果原始Imagenet数据集已经存储在本地,用户可以相应地修改脚本实现本地的格式转换。转换后的TFRecord文件一共有144GB。
+
+下面的命令会使用两块Orion vGPU在真实Imagenet数据集上使用`rmsprop`优化器训练inception_v3模型,每块vGPU上的batch_size=128, 总batch_size为256。我们在全部训练集上训练5个完整的epoch,训练过程中的checkpoionts存储在`./train_dir`目录。
+
+```bash
+export IMAGENET_DIR=
+python3 ./benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
+ --data_dir=$IMAGENET_DIR \
+ --data_name=imagenet \
+ --print_training_accuracy=True \
+ --train_dir=./train_dir \
+ --save_model_steps=1000 \
+ --eval_during_training_every_n_steps=5000 \
+ --save_summaries_steps=1000 \
+ --summary_verbosity=3 \
+ --model=inception3 \
+ --optimizer=rmsprop \
+ --num_epochs=5 \
+ --num_gpus=2 \
+ --batch_size=128
+```
+
+TensorFlow向屏幕输出如下的日志:
+
+```bash
+VirtaiTech Resource. Build-cuda-7675815-20190624_081551
+# (omit output)
+TensorFlow: 1.12
+Model: inception3
+Dataset: imagenet
+Mode: BenchmarkMode.TRAIN_AND_EVAL
+SingleSess: False
+Batch size: 256 global
+ 128.0 per device
+Num batches: 25022
+Num epochs: 5.00
+Devices: ['/gpu:0', '/gpu:1']
+Data format: NCHW
+Optimizer: rmsprop
+Variables: parameter_server
+==========
+Generating training model
+# (omit output)
+Running warm up
+Done warm up
+Step Img/sec total_loss top_1_accuracy top_5_accuracy
+1 images/sec: 318.7 +/- 0.0 (jitter = 0.0) 7.415 0.004 0.004
+10 images/sec: 357.5 +/- 7.6 (jitter = 15.8) 7.364 0.000 0.000
+# (omit output)
+1490 images/sec: 370.6 +/- 0.3 (jitter = 7.9) 6.736 0.008 0.043
+1500 images/sec: 370.6 +/- 0.3 (jitter = 7.9) 6.654 0.020 0.070
+# (omit output)
+24990 images/sec: 368.8 +/- 0.1 (jitter = 7.8) 3.411 0.352 0.613
+25000 images/sec: 368.8 +/- 0.1 (jitter = 7.8) 3.493 0.344 0.629
+Running evaluation at global_step 25010
+# (omit output)
+Accuracy @ 1 = 0.3692 Accuracy @ 5 = 0.6404 [249856 examples]
+2019-06-26 01:49:26 [INFO] Client exits with allocation ID a2062e12-8199-4515-a8ec-59dfe9723b4d
+```
+
+可以看到`loss`不断下降,精度不断上升,最终在训练完5个epochs后达到36.92%的top-1精度,64.04%的top-5精度。训练过程中,每过5000个batch,TF会进行一次全面的evaluation。有兴趣的读者可以在`server0`上新打开terminal,运行TensorBoard从而在浏览器中监视训练过程:
+
+```bash
+# From inside server0
+tensorboard --logdir ./train_dir
+```
+
+随后,在浏览器中访问`localhost:6006`即可(可能需要用ssh作端口转发,才能在有图形界面的终端访问)
+
+如果运行有异常,用户可以参考[附录相应小节](appendix.md#trouble-client)进行检查。
+
+
+## 附:确认Orion平台工作在RDMA模式下
+
+如果Orion Server或者Orion Client端的RDMA驱动不能正常工作,或者Orion Server的`bind_addr`没有设置为RDMA网段的地址,Orion vGPU软件会自动切换成TCP模式进行数据传输。此时,性能会有明显下降。因此,我们需要通过Orion Server的日志来确认RDMA模式成功开启。
+
+`server1`上,Orion Server的日志输出到`/var/log/orion/session`中。我们进到这个目录下,用`ll -rt`找到最新的日志文件:
+
+```bash
+# From inside server1
+cd /var/log/orion/session
+ll -rt
+```
+
+屏幕最下方的文件即为最新日志,我们查看它的内容:
+
+``` bash
+3686900184038864: 2019-06-25 20:24:52 [INFO] Resource successfully confirmed with controller
+3686900190516972: 2019-06-25 20:24:52 [INFO] Resource successfully confirmed with controller
+3686900190589780: 2019-06-25 20:24:52 [INFO] Final virtual gpu list : 0:0:15500,0:1:15500, begin to initialize CUDA device manager
+3686904167685066: 2019-06-25 20:24:54 [INFO] Registered v-GPU 0 on p-GPU 0.
+3686904167700354: 2019-06-25 20:24:54 [INFO] Registered v-GPU 0 on p-GPU 1.
+3686904167738402: 2019-06-25 20:24:54 [INFO] Architecture initialization is done. Resource is confirmed.
+3686904331240842: 2019-06-25 20:24:54 [INFO] Client supports RDMA mode, then server runs in RDMA mode.
+3686904345514000: 2019-06-25 20:24:54 [INFO] Launching workers ...
+```
+
+说明Orion平台工作模式为RDMA。反之,如果日志中包含
+
+```bash
+[INFO] Client supports TCP mode, then server also falls back to TCP mode.
+```
+
+说明Orion平台工作模式退化为TCP,用户需要检查RDMA环境,以及Orion Server数据通路的设置。
+
+如果Orion Server启动时,配置文件中的`enable_shm`和`enable_rdma`均为`false`,则Orion vGPU软件会默认工作在TCP模式,Orion Server日志中也不会有`Client supports XXX mode...`这行日志。
\ No newline at end of file
diff --git a/doc/quick-start/trouble_shooting.md b/doc/quick-start/trouble_shooting.md
new file mode 100755
index 0000000..3faf7b2
--- /dev/null
+++ b/doc/quick-start/trouble_shooting.md
@@ -0,0 +1,37 @@
+# 常见问题与解答
+
+本章节我们针对用户在阅读和使用Quick Start的过程中可能遇到的问题进行解答。更加全面的常见问题列表,读者可以参考[用户手册相关部分](../Orion-User-Guide.md#常见问题)。
+
+## 安装部署常见问题
+
+## 运行失败
+
+VirtaiTech=>with/without allocation id
+
+with allocation id: fail to connect to server
+
+* GPU节点CUDA环境配置出错(或`deb`安装)
+* 安装时未指定`CUDA_HOME`环境变量
+* Orion Controller无法连接到系统中已有`etcd`服务
+* [INFO] Client exits without allocation ID.
+* Orion Server bind address出错
+* Orion Client ORION_CONTROLLER设置出错(或client.conf出错)
+* Orion Client 没有设置ORION_VGPU环境变量
+* Orion Client由于防火墙设置,无法与Orion Controller和Orion Server通信
+* container内没有mount SHM
+* 多个container使用了相同的SHM
+* 把/dev/shm目录全mount进了容器
+* SHM被误删,而没有重启server
+* Controller被杀死,重启后没有重启server
+* 修改`/etc/orion/server.conf`后没有重启`server`
+
+## Orion Client状态检查
+
+防火墙
+
+## **资源分配相关错误**
+
+## **显存 quota 相关**
+
+## **MPS相关错误**
+
diff --git a/dockerfiles/README.md b/dockerfiles/README.md
new file mode 100755
index 0000000..4d4da06
--- /dev/null
+++ b/dockerfiles/README.md
@@ -0,0 +1,67 @@
+# Docker镜像
+
+我们准备了安装有Orion Client Runtime,以及TensorFlow,PyTorch的不同镜像。其中,
+* TensorFlow 1.12直接从`pip`源安装
+* PyTorch 1.1.0从官方源码直接编译生成
+* 镜像内操作系统均为`Ubuntu 16.04`
+* 在部分镜像中,我们还安装了`MNLX_OFED 4.5.1`RDMA驱动
+
+此repo中的Dockerfile对应于Orion vGPU软件的官方[Docker Hub Registry](https://hub.docker.com/r/virtaitech/orion-client)。
+
+需要注意的是,每个镜像对应的路径下所需要的
+* `install-client`安装包
+* MLNX_OFED 4.5-1.0.1.0驱动
+* 以及PyTorch从源码编译得到的wheel包
+
+需要用户自行放置到路径下,方可成功运行`docker build`。
+
+## [TensorFlow 1.12 基础镜像](./client-tf1.12-base)
+
+```bash
+docker pull virtaitech/orion-client:tf1.12-base
+```
+
+此镜像中通过`pip3 install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+
+## [TensorFlow 1.12 带MNLX驱动,Python 3.5环境](./client-tf1.12-py3)
+
+```bash
+docker pull virtaitech/orion-client:tf1.12-py3
+```
+
+此镜像中通过`pip3 install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+
+此外,我们安装了`MNLX_OFED 4.5.1`RDMA驱动,用户如果将Mellanox的RDMA设备直通到容器内部,就可以参照quick-start文档中的[通过RDMA使用远程节点GPU资源](./quick-start/remote_rdma.md)章节内容在容器内部使用远程GPU资源。
+
+为了展示的方便,我们同样安装了Juypter Notebook和部分Python packages。
+
+## [TensorFlow 1.12 带MNLX驱动,Python 2.7环境](./client-tf1.12-py2)
+
+```bash
+docker pull virtaitech/orion-client:tf1.12-py2
+```
+
+此镜像中通过`pip install tensorflow-gpu==1.12`安装了官方TensorFlow,然后通过`install-client`安装包安装了Orion Client运行时。
+
+此外,我们安装了`MNLX_OFED 4.5.1`RDMA驱动,用户如果将Mellanox的RDMA设备直通到容器内部,就可以参照quick-start文档中的[通过RDMA使用远程节点GPU资源](./quick-start/remote_rdma.md)章节内容在容器内部使用远程GPU资源。
+
+本镜像中,我们安装了部分Python packages,以便用户使用[TensorFlow Object Detection](https://github.com/tensorflow/models/tree/master/research/object_detection)模型,以及其余[官方Models](https://github.com/tensorflow/models)。
+
+## [PyTorch 1.1.0, Python 3.5环境](./client-pytorch-1.1.0-py3)
+
+由于PyTorch官方提供的`pip`源wheel包里面编译了太多组件,部分组件我们这一版的Orion vGPU软件还不支持,我们通过PyTorch的源码编译了1.1.0版本的wheel包。我们没有对源码进行任何修改,只是更改了编译选项。
+
+我们同样从源码开始,使用默认编译选项编译了torchvision 0.3.0版本,打包进镜像。我们也安装了部分Python packages,使得用户可以直接在镜像里面运行PyTorch的官方examples:https://github.com/pytorch/examples
+
+最后,我们通过通过`install-client`安装包安装了Orion Client运行时。
+
+我们在[PyTorch 1.10 Python3.5 镜像](./client-pytorch-1.1.0-py3)中介绍了我们编译PyTorch 1.1.0,TorchVision 0.3.0,以及安装Orion Client Runtime的步骤,用户可以参考。
+
+### 注意事项
+由于PyTorch DataLoader需要通过IPC通讯,启动容器时需要通过`--shm-size=8G`参数保证DataLoader可以正常工作。这一点对于Native环境也是一样的。
+
+此外,由于我们对PyTorch的支持还在持续开发中,用户需要注意的是:
+* 我们还不支持PyTorch通过RDMA网络使用远程GPU资源
+* 在使用多卡训练时,需要用GLOO作为后端,而不是默认的NCCL
+
+在我们的[一篇技术博客](../blogposts/use-pytorch.md)里,我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
\ No newline at end of file
diff --git a/dockerfiles/client-pytorch-1.1.0-py3/Dockerfile b/dockerfiles/client-pytorch-1.1.0-py3/Dockerfile
new file mode 100755
index 0000000..1fd112c
--- /dev/null
+++ b/dockerfiles/client-pytorch-1.1.0-py3/Dockerfile
@@ -0,0 +1,42 @@
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y libjpeg-dev zlib1g-dev libopenmpi-dev libomp-dev &&\
+ apt clean
+
+# Setup pip source
+COPY pip.conf /etc/
+
+WORKDIR /root
+
+# Install PyTorch, torchvision and other python packages
+COPY torch-1.1.0-cp35-cp35m-linux_x86_64.whl .
+RUN pip3 install torch-1.1.0-cp35-cp35m-linux_x86_64.whl && rm torch-1.1.0-cp35-cp35m-linux_x86_64.whl
+COPY requirement.txt .
+RUN pip3 install -r requirement.txt && rm requirement.txt
+
+COPY torchvision /usr/local/lib/python3.5/dist-packages/torchvision
+
+# Install Orion Client runtime
+ENV CUDA_HOME=/usr/local/cuda-9.0
+RUN mkdir -p $CUDA_HOME && mkdir -p $CUDA_HOME/lib64
+ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
+
+COPY install-client .
+RUN chmod +x install-client && ./install-client -d $CUDA_HOME/lib64 -q && rm install-client
+
+RUN ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnvToolsExt.so.1 &&\
+ ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnccl.so.2
+
+# Set the num of Orion vGPU each process requests from Orion Controller
+ENV ORION_VGPU=1
+
+WORKDIR /root
+CMD ["/bin/bash"]
diff --git a/dockerfiles/client-pytorch-1.1.0-py3/README.md b/dockerfiles/client-pytorch-1.1.0-py3/README.md
new file mode 100755
index 0000000..5733ac8
--- /dev/null
+++ b/dockerfiles/client-pytorch-1.1.0-py3/README.md
@@ -0,0 +1,124 @@
+# PyTorch 1.10 Python3.5 镜像
+
+## 注意事项
+由于PyTorch DataLoader需要通过IPC通讯,启动容器时需要通过`--shm-size=8G`参数保证DataLoader可以正常工作。这一点对于Native环境也是一样的。
+
+此外,由于我们对PyTorch的支持还在持续开发中,用户需要注意的是:
+* 我们还不支持PyTorch通过RDMA网络使用远程GPU资源
+* 在使用多卡训练时,需要用GLOO作为后端,而不是默认的NCCL
+
+在我们的[一篇技术博客](../../blogposts/use-pytorch.md)里,我们介绍了如何让PyTorch使用多块Orion vGPU在Imagenet数据集上训练Resnet50模型。
+
+如果要构建镜像,用户需要按照下面的步骤从源码编译PyTorch和TorchVision。
+
+## 从源码编译PyTorch 1.1.0的Python 3.5版本
+
+我们以Ubuntu 16.04环境为例。
+
+首先`git clone`相应的repo,以及第三方依赖项:
+
+```bash
+git clone --recursive https://github.com/pytorch/pytorch
+cd pytorch
+git checkout v1.1.0 # switch to v1.1.0 branch
+```
+
+如果之前已经克隆了repo,用下面命令更新第三方库
+
+```bash
+git checkout v1.1.0
+git submodule sync
+git submodule update --init --recursive
+```
+
+然后,我们安装相应的依赖项:
+
+```bash
+apt install python3-dev python3-pip cmake g++ \
+ libopenmpi-dev libomp-dev libjpeg-dev zlib1g-dev
+
+pip3 install numpy pillow
+```
+
+在安装前,我们通过环境变量设置编译选项:
+
+```bash
+export NO_TEST=1
+export NO_FBGEMM=1
+export NO_MIOPEN=1
+export NO_MKLDNN=1
+export NO_NNPACK=1
+export NO_QNNPACK=1
+export USE_STATIC_NCCL=0
+export TORCH_CUDA_ARCH_LIST="3.5;6.0;6.1;7.0"
+```
+
+最后,构建wheel包:
+
+```bash
+cd pytorch
+python3 setup.py bdist_wheel
+```
+
+用户可以在生成的`dist`目录下,找到生成的`torch-1.1.0-cp35-cp35m-linux_x86_64.whl`。
+
+## 从源码编译TorchVision 0.3.0
+
+最新(2019/06/29)的TorchVision 0.3.0和PyTorch 1.1.0相匹配。从源码build PyTorch之后,TorchVision也需要重新build。
+
+```bash
+git clone https://github.com/pytorch/vision.git
+cd vision
+git checkout v0.3.0
+```
+
+然后用默认参数build:
+
+```bash
+python3 setup.py build
+```
+
+视情况而定,用户可能需要安装`pillow`等Python库。完成后,用户可以在生成的`build/lib.linux-x86_64-3.5`目录下找到生成的`torchvision`目录:
+
+```bash
+ls build/lib.linux-x86_64-3.5/torchvision
+
+_C.cpython-35m-x86_64-linux-gnu.so __init__.py ops utils.py
+datasets models transforms version.py
+```
+
+构建Docker镜像时,只要拷贝这个目录到容器内Python3.5 dist-packages路径即可:
+
+```bash
+COPY torchvision /usr/local/lib/python3.5/dist-packages/torchvision
+```
+
+## 最后步骤
+
+在运行`docker build`之前,用户需要把`install-client`安装包,以及上面两步得到的PyTorch wheel包,以及TorchVision都放到Dockerfile所在路径下。
+
+## 附录:安装Orion Client运行时
+
+我们进一步解释Dockerfile中安装Orion Client Runtime相关的步骤。
+
+PyTorch经过CMake编译后指定了RPATH。如果用户build PyTorch时,`CUDA_HOME=/usr/local/cuda-9.0`,那么容器内Orion Client运行时必须安装到这个路径下才可以支持PyTorch使用Orion vGPU。
+安装至非默认路径时,需要手动配置`LD_LIBRARY_PATH`。
+
+```bash
+ENV CUDA_HOME=/usr/local/cuda-9.0
+RUN mkdir -p $CUDA_HOME && mkdir -p $CUDA_HOME/lib64
+ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
+
+COPY install-client .
+RUN chmod +x install-client && ./install-client -d $CUDA_HOME/lib64 -q && rm install-client
+```
+
+由于PyTorch默认还依赖`libnvToolsExt.so.1`和`libnccl.so.2`,我们需要创建软链接保证PyTorch能正常运行任务:
+
+```bash
+RUN ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnvToolsExt.so.1 &&\
+ ln -sf $CUDA_HOME/lib64/liborion.so $CUDA_HOME/lib64/libnccl.so.2
+```
+
+
+
diff --git a/dockerfiles/client-pytorch-1.1.0-py3/install-client b/dockerfiles/client-pytorch-1.1.0-py3/install-client
new file mode 100755
index 0000000..49a5a00
Binary files /dev/null and b/dockerfiles/client-pytorch-1.1.0-py3/install-client differ
diff --git a/dockerfiles/client-pytorch-1.1.0-py3/pip.conf b/dockerfiles/client-pytorch-1.1.0-py3/pip.conf
new file mode 100755
index 0000000..9fc6e2d
--- /dev/null
+++ b/dockerfiles/client-pytorch-1.1.0-py3/pip.conf
@@ -0,0 +1,3 @@
+[global]
+index-url=https://pypi.doubanio.com/simple/
+trusted-host=pypi.doubanio.com
diff --git a/dockerfiles/client-pytorch-1.1.0-py3/requirement.txt b/dockerfiles/client-pytorch-1.1.0-py3/requirement.txt
new file mode 100755
index 0000000..a9e51d2
--- /dev/null
+++ b/dockerfiles/client-pytorch-1.1.0-py3/requirement.txt
@@ -0,0 +1,8 @@
+pillow
+scipy==1.2.0
+matplotlib==3.0
+pandas
+cython
+tqdm
+contextlib2
+lxml
diff --git a/dockerfiles/client-tf1.12-base/Dockerfile b/dockerfiles/client-tf1.12-base/Dockerfile
new file mode 100755
index 0000000..d1197cb
--- /dev/null
+++ b/dockerfiles/client-tf1.12-base/Dockerfile
@@ -0,0 +1,37 @@
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt clean
+
+# Configurate pip source
+COPY pip.conf /etc/
+
+# Install TensorFlow 1.12 GPU version
+RUN pip3 install tensorflow-gpu==1.12.0
+
+# Install Python packages
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt && rm requirements.txt
+
+WORKDIR /root
+
+# TensorFlow official benchmark
+RUN git clone --branch=cnn_tf_v1.12_compatible https://github.com/tensorflow/benchmarks.git
+
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+
+# Set default ORION_VGPU for each process requesting vgpu resources
+ENV ORION_VGPU 1
+
+WORKDIR /root
+CMD ["/bin/bash"]
diff --git a/dockerfiles/client-tf1.12-base/README.md b/dockerfiles/client-tf1.12-base/README.md
new file mode 100755
index 0000000..112674d
--- /dev/null
+++ b/dockerfiles/client-tf1.12-base/README.md
@@ -0,0 +1,2 @@
+# 构建镜像
+用户只需将`install-client`安装包放到Dockerfile所在的路径下,即可通过`docker build`命令构建镜像。
\ No newline at end of file
diff --git a/dockerfiles/client-tf1.12-base/build-docker-images.sh b/dockerfiles/client-tf1.12-base/build-docker-images.sh
new file mode 100755
index 0000000..fbe9df2
--- /dev/null
+++ b/dockerfiles/client-tf1.12-base/build-docker-images.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+set -e
+
+cd `dirname $0`
+
+docker build -t orion-client:tf1.12-py3 .
diff --git a/dockerfiles/client-tf1.12-base/install-client b/dockerfiles/client-tf1.12-base/install-client
new file mode 100755
index 0000000..49a5a00
Binary files /dev/null and b/dockerfiles/client-tf1.12-base/install-client differ
diff --git a/dockerfiles/client-tf1.12-base/pip.conf b/dockerfiles/client-tf1.12-base/pip.conf
new file mode 100755
index 0000000..9fc6e2d
--- /dev/null
+++ b/dockerfiles/client-tf1.12-base/pip.conf
@@ -0,0 +1,3 @@
+[global]
+index-url=https://pypi.doubanio.com/simple/
+trusted-host=pypi.doubanio.com
diff --git a/dockerfiles/client-tf1.12-base/requirements.txt b/dockerfiles/client-tf1.12-base/requirements.txt
new file mode 100755
index 0000000..a4a8a64
--- /dev/null
+++ b/dockerfiles/client-tf1.12-base/requirements.txt
@@ -0,0 +1,2 @@
+pillow
+scipy==1.2.0
diff --git a/dockerfiles/client-tf1.12-py2/Dockerfile b/dockerfiles/client-tf1.12-py2/Dockerfile
new file mode 100755
index 0000000..9930c71
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py2/Dockerfile
@@ -0,0 +1,45 @@
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python-dev python-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt install -y python-tk &&\
+ apt clean
+
+# Install RDMA driver
+WORKDIR /tmp
+
+COPY MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz .
+
+RUN tar xvf MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz &&\
+ cd MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64 &&\
+ ./mlnxofedinstall --user-space-only --without-fw-update --all --force -q &&\
+ cd /tmp && rm -rf *
+
+# Configurate pip source
+COPY pip.conf /etc/
+
+# Install TensorFlow 1.12 GPU version
+RUN pip install tensorflow-gpu==1.12.0
+
+# Install Python packages
+COPY requirements.txt .
+RUN pip install -r requirements.txt && rm requirements.txt
+
+WORKDIR /root
+
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+
+# Set default ORION_VGPU for each process requesting vgpu resources
+ENV ORION_VGPU 1
+
+WORKDIR /root
+CMD ["/bin/bash"]
diff --git a/dockerfiles/client-tf1.12-py2/README.md b/dockerfiles/client-tf1.12-py2/README.md
new file mode 100755
index 0000000..750d7a7
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py2/README.md
@@ -0,0 +1,9 @@
+# 构建镜像
+用户需要将`install-client`安装包放到Dockerfile所在的路径下。
+
+然后,用户需要在Mellanox官网下载MLNX_OFED 4.5-1.0.1.0驱动:
+http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-4.5-1.0.1.0&mname=MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz
+
+接受协议后方可在浏览器中开始下载。
+
+最后,用户可以通过`docker build`命令构建镜像。
\ No newline at end of file
diff --git a/dockerfiles/client-tf1.12-py2/build-docker-images.sh b/dockerfiles/client-tf1.12-py2/build-docker-images.sh
new file mode 100755
index 0000000..fbe9df2
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py2/build-docker-images.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+set -e
+
+cd `dirname $0`
+
+docker build -t orion-client:tf1.12-py3 .
diff --git a/dockerfiles/client-tf1.12-py2/pip.conf b/dockerfiles/client-tf1.12-py2/pip.conf
new file mode 100755
index 0000000..9fc6e2d
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py2/pip.conf
@@ -0,0 +1,3 @@
+[global]
+index-url=https://pypi.doubanio.com/simple/
+trusted-host=pypi.doubanio.com
diff --git a/dockerfiles/client-tf1.12-py2/requirements.txt b/dockerfiles/client-tf1.12-py2/requirements.txt
new file mode 100755
index 0000000..a43707c
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py2/requirements.txt
@@ -0,0 +1,7 @@
+pillow
+scipy==1.2.0
+matplotlib==2.2.4
+contextlib2
+lxml
+cython
+requests
diff --git a/dockerfiles/client-tf1.12-py3/Dockerfile b/dockerfiles/client-tf1.12-py3/Dockerfile
new file mode 100755
index 0000000..fca3172
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py3/Dockerfile
@@ -0,0 +1,44 @@
+FROM ubuntu:16.04
+MAINTAINER zoumao@virtaitech.com
+
+RUN sed -i 's/archive.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list
+
+RUN apt update -y &&\
+ apt install -y libcurl4-openssl-dev &&\
+ apt install -y python3-dev python3-pip &&\
+ apt install -y git wget curl bc net-tools &&\
+ apt install -y lsb-core &&\
+ apt install -y vim &&\
+ apt clean
+
+# Install RDMA driver
+WORKDIR /tmp
+
+COPY MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz .
+
+RUN tar xvf MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz &&\
+ cd MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64 &&\
+ ./mlnxofedinstall --user-space-only --without-fw-update --all --force -q &&\
+ cd /tmp && rm -rf *
+
+# Configurate pip source
+COPY pip.conf /etc/
+
+# Install TensorFlow 1.12 GPU version
+RUN pip3 install tensorflow-gpu==1.12.0
+
+# Install Python packages
+COPY requirements.txt .
+RUN pip3 install -r requirements.txt && rm requirements.txt
+
+WORKDIR /root
+
+# Install Orion Client runtime
+COPY install-client .
+RUN chmod +x install-client && ./install-client -q && rm install-client
+
+# Set default ORION_VGPU for each process requesting vgpu resources
+ENV ORION_VGPU 1
+
+WORKDIR /root
+CMD ["/bin/bash"]
diff --git a/dockerfiles/client-tf1.12-py3/README.md b/dockerfiles/client-tf1.12-py3/README.md
new file mode 100755
index 0000000..7bd7112
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py3/README.md
@@ -0,0 +1,9 @@
+# 构建镜像
+用户需要将`install-client`安装包放到Dockerfile所在的路径下。
+
+然后,用户需要在Mellanox官网下载MLNX_OFED 4.5-1.0.1.0驱动:
+http://www.mellanox.com/page/mlnx_ofed_eula?mtag=linux_sw_drivers&mrequest=downloads&mtype=ofed&mver=MLNX_OFED-4.5-1.0.1.0&mname=MLNX_OFED_LINUX-4.5-1.0.1.0-ubuntu16.04-x86_64.tgz
+
+接受协议后方可在浏览器中开始下载。
+
+最后,用户可以通过`docker build`命令构建镜像。
\ No newline at end of file
diff --git a/dockerfiles/client-tf1.12-py3/build-docker-images.sh b/dockerfiles/client-tf1.12-py3/build-docker-images.sh
new file mode 100755
index 0000000..fbe9df2
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py3/build-docker-images.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+
+set -e
+
+cd `dirname $0`
+
+docker build -t orion-client:tf1.12-py3 .
diff --git a/dockerfiles/client-tf1.12-py3/install-client b/dockerfiles/client-tf1.12-py3/install-client
new file mode 100755
index 0000000..49a5a00
Binary files /dev/null and b/dockerfiles/client-tf1.12-py3/install-client differ
diff --git a/dockerfiles/client-tf1.12-py3/pip.conf b/dockerfiles/client-tf1.12-py3/pip.conf
new file mode 100755
index 0000000..9fc6e2d
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py3/pip.conf
@@ -0,0 +1,3 @@
+[global]
+index-url=https://pypi.doubanio.com/simple/
+trusted-host=pypi.doubanio.com
diff --git a/dockerfiles/client-tf1.12-py3/requirements.txt b/dockerfiles/client-tf1.12-py3/requirements.txt
new file mode 100755
index 0000000..1c74518
--- /dev/null
+++ b/dockerfiles/client-tf1.12-py3/requirements.txt
@@ -0,0 +1,8 @@
+pillow
+scipy==1.2.0
+matplotlib==3.0
+contextlib2
+lxml
+cython
+jupyter
+requests
diff --git a/install-client b/install-client
new file mode 100755
index 0000000..49a5a00
Binary files /dev/null and b/install-client differ
diff --git a/install-server.sh b/install-server.sh
new file mode 100755
index 0000000..2ea4f59
--- /dev/null
+++ b/install-server.sh
@@ -0,0 +1,114 @@
+#!/bin/bash
+
+cd `dirname $0`
+SELF_PATH=$(pwd)
+
+function print_help {
+ echo "Usage: install-server.sh [-h|-d [target path]]"
+ echo " -d installed target path. Default /usr/bin"
+ echo " -h print this help"
+}
+
+
+install_path="/usr/bin"
+
+while getopts "d:h" opt
+do
+ case $opt in
+ d) install_path=$OPTARG;;
+ h)
+ print_help
+ exit 0;;
+ ?)
+ print_help
+ exit 1;;
+ esac
+done
+
+if [ ! -f oriond ]; then
+ echo "Can not find binary oriond. Please check your install package."
+ exit 1
+fi
+
+if [ ! -f orion-check ]; then
+ echo "Can not find binary orion-check. Please check your install package."
+ exit 1
+fi
+
+if [ ! -f orion-shm ]; then
+ echo "Can not find binary orion-shm. Please check your install package."
+ exit 1
+fi
+
+if [ "$(id -u)" != "0" ]; then
+ echo "Error. Root privilege is required to install Orion Server."
+ exit 1
+fi
+
+if systemctl status oriond > /dev/null 2>&1; then
+ systemctl stop oriond
+fi
+
+mkdir -p /var/log/orion
+chmod 777 /var/log/orion
+
+cp oriond orion-check orion-shm $install_path
+chmod 755 $install_path/oriond
+chmod 755 $install_path/orion-check
+chmod 755 $install_path/orion-shm
+
+if which virsh > /dev/null 2>&1; then
+ if virsh capabilities | grep -F "apparmor" > /dev/null 2>&1; then
+ armor_qemu_file=/etc/apparmor.d/abstractions/libvirt-qemu
+ if [ -f $armor_qemu_file ]; then
+ if grep -F "orionsock*" $armor_qemu_file > /dev/null; then
+ :
+ else
+ sed -i '/^\s*\/[{]*dev\>.*\/shm\>\s*r,.*/a \ \ \/dev\/shm\/orionsock* rw,' $armor_qemu_file
+ systemctl reload apparmor.service
+ fi
+ fi
+ fi
+fi
+
+if [ -f orion.conf.template ]; then
+ mkdir -p /etc/orion
+ cp orion.conf.template /etc/orion/server.conf
+ chmod 755 /etc/orion
+ chmod 644 /etc/orion/server.conf
+ echo "orion.conf.template is copied to /etc/orion/server.conf as Orion Server configuration file."
+fi
+
+echo "Orion Server is successfully installed to $install_path"
+CUDA_HOME=${CUDA_HOME:-"/usr/local/cuda-9.0"}
+
+cat > /etc/systemd/system/oriond.service << EOF
+[Unit]
+Description=Orion Server Daemon Service
+
+[Service]
+Type=simple
+ExecStart=/usr/bin/oriond
+KillMode=process
+KillSignal=SIGINT
+SendSIGKILL=yes
+Environment="LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH"
+Environment="PATH=$CUDA_HOME/bin:$PATH"
+
+[Install]
+WantedBy=multi-user.target
+EOF
+
+systemctl reload oriond > /dev/null 2>&1
+systemctl enable oriond > /dev/null 2>&1
+
+echo "Orion Server is registered as system service."
+echo "Using following commands to interact with Orion Server :"
+echo -e "\n\tsystemctl start oriond # start oriond daemon"
+echo -e "\tsystemctl status oriond # print oriond daemon status and screen output"
+echo -e "\tsystemctl stop oriond # stop oriond daemon"
+echo -e "\tjournalctl -u oriond # print oriond stdout"
+
+echo -e "\nBefore launching Orion Server, please change settings in /etc/orion/server.conf according to your environment.\n"
+
+
diff --git a/md5sum.txt b/md5sum.txt
new file mode 100755
index 0000000..a5c7cdd
--- /dev/null
+++ b/md5sum.txt
@@ -0,0 +1,7 @@
+74872f49042cf6570b821e1cba7c892e ./oriond
+93eded304497cff77b4f3ef013108e16 ./install-server.sh
+69d642f58f3c793c885111188409792b ./orion-shm
+93fd51ccba8a56c695d58ecb7a83c6e2 ./orion.conf.template
+c5974ac53ff4b55e64e3023a26dd91d3 ./orion-controller
+2a593887296675fd80a3a2d8fb190dcb ./install-client
+5b2745b18ca619eebcce4c86d5258081 ./orion-check
diff --git a/orion-check b/orion-check
new file mode 100755
index 0000000..8c6ba08
--- /dev/null
+++ b/orion-check
@@ -0,0 +1,865 @@
+#!/bin/bash
+
+cd `dirname $0`
+SELF_PATH=$(pwd)
+
+function print_help {
+ echo "
+NAME:
+ Orion Health Check Tool
+
+VERSION:
+ v0.1
+
+COMMANDS:
+ install
+ server Check health for installing Orion Server
+ client Check health for installing Orion Client
+ controller Check health for installing Orion Controller
+ all Check health for installing all Orion components
+
+ runtime
+ server Diagnose the status for Orion Server running
+ client Diagnose the status for Orion Client running
+
+EXAMPLES:
+ orion-check install server
+ orion-check install client
+ orion-check runtime server
+"
+}
+
+
+while getopts "h" opt
+do
+ case $opt in
+ h)
+ print_help
+ exit 0;;
+ ?)
+ print_help
+ exit 1;;
+ esac
+done
+
+if [ -z "$1" -o -z "$2" ]; then
+ echo "Invalid usage for Orion Health Check Tool."
+ print_help
+ exit 1
+fi
+
+summary_os_support=
+function check_os {
+ OS_NAME=
+ OS_VERSION=
+ KERNAL_VERSION=$(uname -r)
+ if [ ! -f /etc/os-release ]; then
+ echo -e "\nOS information : unknown OS"
+ echo -e " : Kernel $KERNAL_VERSION"
+ return 0
+ fi
+
+ OS_NAME=$(cat /etc/os-release | grep -w NAME | awk -F '"' '{print $2}' | awk '{print $1}')
+ OS_VERSION=$(cat /etc/os-release | grep -w VERSION_ID | awk -F '"' '{print $2}')
+ echo -e "\nOS information : $OS_NAME $OS_VERSION"
+ echo -e " : Kernel $KERNAL_VERSION"
+
+ if [ $OS_NAME == "CentOS" ]; then
+ if [ $OS_VERSION == "7" ]; then
+ summary_os_support="Yes"
+ else
+ summary_os_support="No"
+ fi
+ elif [ $OS_NAME == "Ubuntu" ]; then
+ if [ $OS_VERSION == "16.04" ]; then
+ summary_os_support="Yes"
+ else
+ summary_os_support="Unknown"
+ fi
+ else
+ summary_os_support="Unknown"
+ fi
+}
+
+
+function check_hw_configuration {
+ echo -e "\nChecking CPU configuration ..."
+ lscpu | head -n -1
+
+ echo -e "\nChecking disk space ..."
+ df -hT
+}
+
+
+default_lib_path=(
+/lib
+/lib64
+/usr/lib
+/usr/lib64
+/usr/local/lib
+/usr/local/lib64
+)
+rdma_driver=0
+rdma_support=0
+summary_rdma_support="No"
+function find_rdma_support {
+ echo -e "\nChecking RDMA network support ..."
+ if ls /dev/infiniband/rdma_cm > /dev/null 2>&1; then
+ if ls /dev/infiniband/uverbs* > /dev/null 2>&1; then
+ rdma_driver=1
+ for path in ${default_lib_path[@]}; do
+ if [ -d $path ]; then
+ result=$(find $path -name "librdmacm.so")
+ if [ -n "$result" ]; then
+ rdma_support=1
+ fi
+ fi
+ done
+ fi
+ fi
+
+ if [ $rdma_support -eq 0 ]; then
+ if [ $rdma_driver -eq 0 ]; then
+ echo "No RDMA network support is found in the system."
+ else
+ echo "RDMA device is found but rdmacm library is not found in default searching path."
+ fi
+ else
+ echo "RDMA support is found in the system."
+ summary_rdma_support="Yes"
+ # Try to get information by using Mellanox OFED tools
+ if which ibdev2netdev > /dev/null 2>&1; then
+ printf "\n RMDA-Port RDMA-Rate Interface Status\n"
+ printf " ------------------------------------------------\n"
+ i=0
+ ibdev2netdev |
+ while IFS= read -r line
+ do
+ mlx_port_name[$i]=$(echo $line | awk '{print $1}')
+ mlx_port_interface[$i]=$(echo $line | awk '{print $5}')
+ mlx_port_interface_status[$i]=$(echo $line | awk '{print $6}')
+
+ if which ibstatus > /dev/null 2>&1; then
+ mlx_port_rate[$i]=$(ibstatus ${mlx_port_name[$i]} | grep -F "rate:" | awk '{print $2, $3}')
+ fi
+ printf "%8s %14s %14s %10s\n" "${mlx_port_name[$i]}" "${mlx_port_rate[$i]}" "${mlx_port_interface[$i]}" "${mlx_port_interface_status[$i]}"
+ let i++
+ done
+ fi
+ fi
+}
+
+
+cuda_install_path="/usr/local"
+cuda_path=
+summary_cuda_support="No"
+function find_cuda {
+ echo -e "\nSearching CUDA ..."
+ if [ -n "$CUDA_HOME" ]; then
+ echo "CUDA_HOME is set to ${CUDA_HOME}"
+ fi
+
+ possible_path=$(find $cuda_install_path -maxdepth 1 -type d -name "cuda*")
+ if [ -z "$possible_path" ]; then
+ echo -e "\033[31m[Error] Fail to find cuda in default path $cuda_install_path:\033[0m"
+ return 1
+ fi
+
+ i=0
+ for path in $possible_path; do
+ if ls $path/version.txt > /dev/null 2>&1; then
+ version=$(cat $path/version.txt | head -n 1)
+ echo "Find $version in $path"
+ if echo $version | grep "\<9.0." > /dev/null; then
+ summary_cuda_support="Yes"
+ fi
+ cuda_path[$i]=$path
+ let i++
+ fi
+ done
+}
+
+cudnn_version=
+summary_cudnn_support="No"
+function find_cudnn {
+ echo -e "\nSearching CUDNN ..."
+ i=0
+ for path in ${cuda_path[@]}; do
+ version=
+ pushd $path/lib64 > /dev/null
+ libcudnn=$(find -type l -name "libcudnn.so*" -o -type f -name "libcudnn.so*" | awk -F '/' '{print $2}')
+ if [ -n "$libcudnn" ]; then
+ version=$(find -type f -name "libcudnn.so*" | awk -F 'so.' '{print $2}')
+ if [ -z "$version" ]; then
+ version="(unknown version)"
+ if [ $summary_cudnn_support == "No" ]; then
+ summary_cudnn_support="Unknown"
+ fi
+ fi
+ echo "CUDNN $version is installed in $path"
+ ls -l libcudnn.so* | awk '{print " ", $9, $10, $11}'
+ cudnn_version[$i]=${version}
+
+ cudnn_major=$(echo $version | awk -F '.' '{print $1}')
+ cudnn_mid=$(echo $version | awk -F '.' '{print $2}')
+ if [ $cudnn_major == "7" ]; then
+ if [ $cudnn_mid -gt 1 ]; then
+ summary_cudnn_support="Yes"
+ fi
+ fi
+ let i++
+ fi
+ popd > /dev/null
+ done
+
+ if [ $i -eq 0 ]; then
+ echo "No CUDNN library is found."
+ fi
+}
+
+
+summary_nvidia_gpu_support="No"
+function find_nvidia_gpu {
+ cuda_driver_version=
+ nvidia_gpu=
+
+ echo -e "\nSearching NVIDIA GPU ..."
+ if nvidia-smi -i 0 -q > /dev/null; then
+ cuda_driver_version=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader)
+ echo "CUDA driver $cuda_driver_version is installed."
+
+ gpus=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader)
+ if [ -z "$gpus" ]; then
+ echo -e "\033[33m[Warning] No NVIDIA GPU is found in the system.\033[0m"
+ return 0
+ fi
+
+ summary_nvidia_gpu_support="Yes"
+ i=0
+ tmp_ifs=$IFS
+ IFS=$'\n'
+ for name in $gpus; do
+ nvidia_gpu[$i]="$name"
+ let i++
+ done
+ IFS=$tmp_ifs
+
+ if [ $i -gt 1 ]; then
+ echo "$i NVIDIA GPUs are found :"
+ else
+ echo "$i NVIDIA GPU is found :"
+ fi
+
+ i=0
+ for name in "${nvidia_gpu[@]}"; do
+ echo " $i :" "$name"
+ let i++
+ done
+ else
+ echo -e "\033[31m[Error] Fail to get NVIDIA driver version.\033[0m"
+ return 1
+ fi
+}
+
+
+enable_nvidia_mps=0
+summary_nvidia_mps="OFF"
+function find_mps_support {
+ echo -e "\nChecking NVIDIA MPS ..."
+ user=$(ps -aux | grep -v grep | grep -F "nvidia-cuda-mps-control" | awk '{print $1}')
+ if [ -n "$user" ]; then
+ enable_nvidia_mps=1
+ summary_nvidia_mps="ON"
+ echo "NVIDIA CUDA MPS is running by Linux account : $user"
+ echo -e "\033[33m[Info] Orion only supports enabling MPS with NVIDIA Volta and later GPU.\033[0m"
+ else
+ echo "NVIDIA CUDA MPS is off."
+ fi
+}
+
+
+etcd_running=0
+etcd_version=
+etcd_v2=0
+etcd_v3=0
+summary_etcd_support="No"
+function find_etcd {
+ echo -e "\nSeaching etcd service ..."
+ bin=
+ running_bin=$(ps -aux | grep -v grep | grep -w etcd | awk '{print $11}')
+ if [ -z "$running_bin" ]; then
+ if which etcd > /dev/null 2>&1; then
+ bin=$(which etcd)
+ fi
+ else
+ bin=$running_bin
+ etcd_running=1
+ summary_etcd_support="Yes"
+ fi
+
+ if [ -z "$bin" ]; then
+ echo "No etcd is running or installed in the system."
+ return 1
+ fi
+
+ etcd_version=$($bin --version | grep "etcd Version" | awk '{print $3}')
+ echo "etcd (version $etcd_version) is installed in $bin"
+
+ major=${etcd_version:0:1}
+ if [ $major -eq 2 ]; then
+ etcd_v2=1
+ elif [ $major -eq 3 ]; then
+ etcd_v3=1
+ fi
+}
+
+
+qemu_api_version=
+qemu_version=
+qemu_version_major=
+qemu_version_mid=
+qemu_net_list=
+summary_qemu_kvm_support="No"
+function find_qemu_kvm {
+ echo -e "\nSearching VM support ..."
+ running_bin=$(ps -aux | grep -v grep | grep -w libvirtd | awk '{print $11}')
+ if [ -z "$running_bin" ]; then
+ echo "libvirtd is not running. Please install libvirt-bin and start libvirtd before luanching Orion Server."
+ return 1
+ fi
+
+ qemu_api_version=$(virsh version | grep "Using API" | awk '{print $4}')
+ qemu_version=$(virsh version | grep "Running hypervisor" | awk '{print $4}')
+ qemu_version_major=$(echo $qemu_version | awk -F '.' '{print $1}')
+ qemu_version_mid=$(echo $qemu_version | awk -F '.' '{print $2}')
+ echo "QEMU API version : $qemu_api_version"
+ echo "QEMU version : $qemu_version"
+
+ if [ $qemu_version_major == "2" ]; then
+ summary_qemu_kvm_support="Yes"
+ fi
+
+ nets=$(virsh net-list | tail -n +3 | head -n -1 | awk '{print $1}')
+ i=0
+ for net in $nets; do
+ qemu_net_list[$i]=$(virsh net-dumpxml $net | grep -F " /dev/null 2>&1; then
+ docker_installed=1
+ if docker images 2>&1 | grep "Cannot connect" > /dev/null; then
+ echo "Docker is not launched in the system"
+ return
+ fi
+
+ if docker images 2>&1 | grep "permission denied" > /dev/null; then
+ echo "Permission denied to check docker environment."
+ return
+ fi
+
+ summary_docker_support="Yes"
+ docker version
+ docker_gateway=$(docker inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' bridge)
+ if [ -z "$docker_gateway" ]; then
+ docker_gateway=$(ip addr show docker0 2>/dev/null | grep inet | awk '{print $2}' | awk -F '/' '{print $1}')
+ fi
+ else
+ echo "Docker is not installed in the system"
+ fi
+}
+
+
+server_name="oriond"
+summary_server_support="No"
+function check_server_install {
+ echo -e "\nChecking Orion Server binary ..."
+ if [ ! -f ${server_name} ]; then
+ echo "Can not find installation file \"${server_name}\""
+ return 1
+ fi
+
+ if [ -z "$CUDA_HOME" ]; then
+ echo -e "\033[33mCUDA_HOME is not set in current enviornment. You may want to set it before doing the checking.\033[0m"
+ else
+ export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
+ fi
+
+ unfound_lib=$(ldd ${server_name} | grep "not found" | awk '{print $1}')
+ if [ -n "$unfound_lib" ]; then
+ echo -e "\033[31mFollowing libraries are needed but not found :\033[0m"
+ echo "$unfound_lib"
+ return 1
+ fi
+
+ summary_server_support="Yes"
+}
+
+
+function check_controller_runtime {
+ config_path=
+ if [ "$1" == "server" ]; then
+ config_path="/etc/orion/server.conf"
+ else
+ config_path="/etc/orion/client.conf"
+ fi
+
+ echo -e "\nChecking Orion Controller status ..."
+ echo -e "\033[33m[Info] Orion Controller setting may be different in different SHELL.\033[0m"
+ echo -e "\033[33m[Info] Environment variable ORION_CONTROLLER has the first priority.\033[0m\n"
+ controller_addr_env=$ORION_CONTROLLER
+ controller_addr=
+ if [ -r $config_path ]; then
+ controller_addr=$(sed -n 's/^\s*controller_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*:[0-9]*\).*/\1/p' $config_path)
+ controller_addr=${controller_addr:-"127.0.0.1:9123"}
+ fi
+
+ if [ -z "$controller_addr_env" -a -z "$controller_addr" ]; then
+ echo -e "\033[31m[Error] No Orion Controller address is set in either environment variable ORION_CONTROLLER or configuration file.\033[0m"
+ return
+ fi
+
+ controller_ip=
+ controller_port=
+ target_addr=
+ if [ -n "$controller_addr_env" ]; then
+ target_addr=$controller_addr_env
+ echo "Environment variable ORION_CONTROLLER is set as ${controller_addr_env} Using this address to diagnose Orion Controller."
+ else
+ target_addr=$controller_addr
+ echo "Orion Controller addrress is set as $controller_addr in configuration file. Using this address to diagnose Orion Controller"
+ fi
+
+ controller_ip=$(echo $target_addr | awk -F ':' '{print $1}')
+ controller_port=$(echo $target_addr | awk -F ':' '{print $2}')
+ if [ -z "$controller_port" ]; then
+ echo -e "\033[31m[Error] Invalid Orion Controller address. No port is specified.\033[0m"
+ return
+ fi
+
+ if which nc > /dev/null 2>&1; then
+ if nc -zv $controller_ip $controller_port > /dev/null 2>&1; then
+ echo "Address $target_addr is reached."
+ else
+ echo -e "\033[31m[Error] Can not reach ${target_addr}. Please make sure Orion Controller is launched at the address, and the firewall is correctly set.\033[0m"
+ return
+ fi
+ fi
+
+ if which curl > /dev/null 2>&1; then
+ result=$(curl -s "http://$target_addr/info?data_version&api_version")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not reach ${target_addr}. Please make sure Orion Controller is launched at the address, and the firewall is correctly set.\033[0m"
+ return
+ else
+ echo "Orion Controller Version Infomation : $result"
+ fi
+
+ data_version=$(echo $result | awk -F ',' '{print $1}')
+ api_version=$(echo $result | awk -F ',' '{print $2}')
+
+ data_version=${data_version/=/:}
+ api_version=${api_version/=/:}
+
+ result=$(curl -s -H "${data_version}" -H "${api_version}" "http://$target_addr/devices?res=nvidia_cuda&used=true")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not fetch vGPU status from Orion Controller.\033[0m"
+ return
+ else
+ free_num=${result##*=}
+ fi
+
+ result=$(curl -s -H "${data_version}" -H "${api_version}" "http://$target_addr/devices?res=nvidia_cuda")
+ if [ $? -ne 0 ]; then
+ echo -e "\033[31m[Error] Can not fetch vGPU status from Orion Controller.\033[0m"
+ return
+ else
+ total_num=${result##*=}
+ fi
+
+ echo "There are $total_num vGPU under managered by Orion Controller. $free_num vGPU are free now."
+ else
+ echo -e "\033[33mLinux curl is needed to diagnose Orion Controller.\033[0m"
+ return
+ fi
+}
+
+
+function check_oriond_runtime {
+ echo -e "\nChecking Orion Server status ..."
+ if ! which netstat > /dev/null 2>&1; then
+ echo "Linux tool netstat is not found. Installing the tool helps to diagnose the system."
+ fi
+ if ! which nc > /dev/null 2>&1; then
+ echo "Linux tool net-cat is not found. Installing the tool helps to diagnose the system."
+ fi
+ if ! which curl > /dev/null 2>&1; then
+ echo "Linux tool curl is not found. Installing the tool helps to diagnose the system."
+ fi
+
+ running_bin=$(ps -aux | grep -v grep | grep -w oriond)
+ if [ -z "$running_bin" ]; then
+ echo -e "\033[33mOrion Server is not running.\033[0m\n"
+
+ bin_path=
+ if [ -f ${server_name} ]; then
+ bin_path=${server_name}
+ else
+ if [ -f /usr/bin/${server_name} ]; then
+ bin_path="/usr/bin/${server_name}"
+ else
+ echo "Can not find Orion Server binary \"oriond\" in either `pwd` or /usr/bin."
+ fi
+ fi
+
+ if [ -r /etc/systemd/system/oriond.service ]; then
+ bin_path=$(cat /etc/systemd/system/oriond.service | grep -F 'ExecStart=' | awk -F '=' '{print $2}')
+ echo "Orion Server has been registered as system service. Using binary $bin_path to infer the runtime environment."
+ ld_path=$(cat /etc/systemd/system/oriond.service | grep -F 'Environment="LD_LIBRARY_PATH=' | awk -F '[="]' '{print $4}')
+ path_path=$(cat /etc/systemd/system/oriond.service | grep -F 'Environment="PATH=' | awk -F '[="]' '{print $4}')
+ if [ -n "$ld_path" ]; then
+ export LD_LIBRARY_PATH=$ld_path
+ echo "Injecting oriond service environment LD_LIBRARY_PATH=$ld_path"
+ fi
+
+ if [ -n "$path_path" ]; then
+ export PATH=$path_path
+ echo "Injecting oriond service environment PATH=$path_path"
+ fi
+ fi
+
+ if [ -n "$bin_path" ]; then
+
+ unfound_lib=$(ldd ${bin_path} | grep "not found" | awk '{print $1}')
+ if [ -n "$unfound_lib" ]; then
+ echo -e "\033[31mFollowing libraries are needed but not found in current environment:\033[0m"
+ echo "$unfound_lib"
+ return 1
+ fi
+ fi
+
+ controller_addr_env=$ORION_CONTROLLER
+ if [ -n "$controller_addr_env" ]; then
+ if echo $controller_addr_env | grep ":" > /dev/null; then
+ echo "Environment variable ORION_CONTROLLER=$controller_addr_env is set in current SHELL."
+ else
+ echo "Environment variable ORION_CONTROLLER=$controller_addr_env is set in current SHELL."
+ echo "\033[33m[Warning] Invalid format. No port is specified.\033[0m"
+ fi
+ fi
+
+ controller_addr=
+ if [ -r /etc/orion/server.conf ]; then
+ controller_addr=$(sed -n 's/^\s*controller_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*:[0-9]*\).*/\1/p' /etc/orion/server.conf)
+ bind_ip=$(sed -n 's/^\s*bind_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\).*/\1/p' /etc/orion/server.conf)
+ bind_port=$(sed -n 's/^\s*listen_port\s*=\s*\([0-9]*\).*/\1/p' /etc/orion/server.conf)
+
+ controller_addr=${controller_addr:-"127.0.0.1:9123"}
+ bind_ip=${bind_ip:-"127.0.0.1"}
+ bind_port=${bind_port:-"9960"}
+
+ echo ""
+ echo "Configuration file is found at /etc/orion/server.conf"
+ echo "Orion Server will connect to Orion Controller at $controller_addr unless the setting is overwritten by environment variable \"ORION_CONTROLLER\""
+ echo "Orion Server will listen on port $bind_port unless the setting is overwritten by -p option"
+
+ valid_ip=0
+ if ip addr > /dev/null 2>&1; then
+ while read line
+ do
+ if [ $bind_ip == ${line} ]; then
+ valid_ip=1
+ break
+ fi
+ done <<< "$(ip addr | grep -w inet | awk '{print $2}' | awk -F '/' '{print $1}')"
+ else
+ valid_ip=1
+ fi
+
+ if [ $valid_ip -eq 1 ]; then
+ echo "Orion Server will bind to address $bind_ip unless the setting is overwritten by -b option"
+ else
+ echo -e "\033[33mOrion Server is configured to bind at address \"${bind_ip}\" which may be invalid.\033[0m"
+ fi
+
+ cfg_enable_shm=0
+ cfg_enable_rdma=0
+ cfg_enable_kvm=0
+ if [ -f /etc/orion/server.conf ]; then
+ result=$(sed -n 's/^\s*enable_shm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_shm=1
+ fi
+ result=$(sed -n 's/^\s*enable_rdma\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_rdma=1
+ fi
+ result=$(sed -n 's/^\s*enable_kvm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_kvm=1
+ fi
+ fi
+
+ printf "%-40s" "Enable SHM"
+ if [ $cfg_enable_shm == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+
+ printf "%-40s" "Enable RDMA"
+ if [ $cfg_enable_rdma == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+
+ printf "%-40s" "Enable Local QEMU-KVM with SHM"
+ if [ $cfg_enable_kvm == 1 ]; then
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+ else
+ echo "No configuration is set in the system. Default setting and environment variables will be used to configure Orion Server."
+ echo -e "Orion Server will connect to Orion Controller set by environment variable \033[32mORION_CONTROLLER\033[0m"
+ echo -e "Orion Server will bind to address \033[32m127.0.0.1\033[0m unless the setting is overwritten by \033[32m-b\033[0m option"
+ echo -e "Orion Server will listen on port \033[32m9960\033[0m unless the setting is overwritten by \033[32m-p\033[0m option"
+ bind_port=9960
+ fi
+
+ if which netstat > /dev/null 2>&1; then
+ result=$(netstat -tulpn 2>/dev/null | grep -w LISTEN | awk '{print $4}' | grep ":${bind_port}\>")
+ if [ -n "$result" ]; then
+ echo -e "\033[33m[Warning] Linux port $bind_port is in used by other program.\033[0m"
+ fi
+ fi
+ else
+ pid=$(ps -aux | awk '{print $2,$11}' | grep -v grep | grep -w "oriond" | awk '{print $1}')
+ pid_controller=$(strings /proc/${pid}/environ | grep ORION_CONTROLLER | awk -F '=' '{print $2}')
+ if [ -n "$pid_controller" ]; then
+ echo "Orion Server runs with environment ORION_CONTROLLER=${pid_controller}"
+ export ORION_CONTROLLER=${pid_controller}
+ fi
+
+ cfg_enable_shm=0
+ cfg_enable_rdma=0
+ cfg_enable_kvm=0
+ if [ -f /etc/orion/server.conf ]; then
+ result=$(sed -n 's/^\s*enable_shm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_shm=1
+ fi
+ result=$(sed -n 's/^\s*enable_rdma\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_rdma=1
+ fi
+ result=$(sed -n 's/^\s*enable_kvm\s*=\s*"\([a-z]*\)".*/\1/p' /etc/orion/server.conf)
+ if [ "$result"x == "truex" ]; then
+ cfg_enable_kvm=1
+ fi
+ fi
+
+ user_name=$(ps -aux | grep -v grep | grep "oriond" | awk '{print $1}')
+ command_line=$(ps -aux | grep -v grep | grep "oriond" | awk '{for(i=11;i<=NF;i++){printf "%s ", $i}; printf "\n"}')
+ echo "Orion Server is running with Linux user : $user_name"
+ echo "Orion Server is running with command line : $command_line"
+
+ cudart_path=$(ls -l /proc/${pid}/map_files | grep libcudart | awk '{print $11}' | head -n 1 | awk -F 'so.' '{print $2}')
+ if [ -n "$cudart_path" ]; then
+ echo "Orion Server is running with CUDA version $cudart_path"
+ fi
+
+ cudnn_path=$(ls -l /proc/${pid}/map_files | grep libcudnn | awk '{print $11}' | head -n 1 | awk -F 'so.' '{print $2}')
+ if [ -n "$cudnn_path" ]; then
+ echo "Orion Server is running with CUDNN version $cudnn_path"
+ fi
+
+
+ enable_shm=0
+ enable_rdma=0
+ enable_kvm=0
+ bind_ip=
+ bind_port=9960
+
+ printf "%-40s" "Enable SHM"
+ if echo $command_line | grep -e " -m " > /dev/null; then
+ enable_shm=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_shm == 1 ]; then
+ enable_shm=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+
+ printf "%-40s" "Enable RDMA"
+ if echo $command_line | grep -e " -r " > /dev/null; then
+ enable_rdma=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_rdma == 1 ]; then
+ enable_rdma=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+
+ printf "%-40s" "Enable Local QEMU-KVM with SHM"
+ if echo $command_line | grep -e " -k " > /dev/null; then
+ enable_kvm=1
+ printf "[Yes]\n"
+ elif [ $cfg_enable_kvm == 1 ]; then
+ enable_kvm=1
+ printf "[Yes]\n"
+ else
+ printf "[No]\n"
+ fi
+
+ if which netstat > /dev/null 2>&1; then
+ listen_addr=$(netstat -nap 2>/dev/null | grep oriond | grep LISTEN | awk '{print $4}' | sort | head -n 1)
+ if [ -n "$listen_addr" ]; then
+ bind_ip=$(echo $listen_addr | awk -F ':' '{print $1}')
+ listen_port=$(echo $listen_addr | awk -F ':' '{print $2}')
+ fi
+ else
+ if echo $command_line | grep -e " -b " > /dev/null; then
+ bind_ip=$(echo $command_line | sed -n 's/.*\s\+-b\s\+\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\).*/\1/p')
+ else
+ if [ -r /etc/orion/server.conf ]; then
+ bind_ip=$(sed -n 's/^\s*bind_addr\s*=\s*\([0-9]*\.[0-9]*\.[0-9]*\.[0-9]*\)/\1/p' /etc/orion/server.conf)
+ bind_port=$(sed -n 's/^\s*listen_port\s*=\s*\([0-9]*\)/\1/p' /etc/orion/server.conf)
+ bind_ip=${bind_ip:-"127.0.0.1"}
+ bind_port=${bind_port:-"9960"}
+ fi
+ fi
+ fi
+
+ printf "%-40s%s\n" "Binding IP Address :" "$bind_ip"
+ printf "%-40s%s\n\n" "Listening Port :" "$bind_port"
+
+ if which nc > /dev/null 2>&1; then
+ echo "Testing the Orion Server network ..."
+ if nc -zv $bind_ip $listen_port > /dev/null 2>&1; then
+ echo "Orion Server can be reached through $listen_addr"
+ else
+ echo "Orion Server can not be reached through $listen_addr"
+ echo "Please check the firewall setting."
+ fi
+ fi
+ fi
+}
+
+
+
+if [ "$1" == "install" ]; then
+ if [ "$2" == "all" ]; then
+ check_os
+ find_rdma_support
+ find_cuda
+ find_cudnn
+ find_nvidia_gpu
+ find_mps_support
+ find_etcd
+ find_qemu_kvm
+ find_docker
+ check_server_install
+
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "CUDA :" "$summary_cuda_support"
+ printf "%-40s [%s]\n" "CUDNN :" "$summary_cudnn_support"
+ printf "%-40s [%s]\n" "NVIDIA GPU :" "$summary_nvidia_gpu_support"
+ printf "%-40s [%s]\n" "NVIDIA CUDA MPS :" "$summary_nvidia_mps"
+ printf "%-40s [%s]\n" "etcd service :" "$summary_etcd_support"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ printf "%-40s [%s]\n" "Orion Server binary:" "$summary_server_support"
+ elif [ "$2" == "server" ]; then
+ check_os
+ find_rdma_support
+ find_cuda
+ find_cudnn
+ find_nvidia_gpu
+ find_mps_support
+ find_qemu_kvm
+ find_docker
+ check_server_install
+
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "CUDA :" "$summary_cuda_support"
+ printf "%-40s [%s]\n" "CUDNN :" "$summary_cudnn_support"
+ printf "%-40s [%s]\n" "NVIDIA GPU :" "$summary_nvidia_gpu_support"
+ printf "%-40s [%s]\n" "NVIDIA CUDA MPS :" "$summary_nvidia_mps"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ printf "%-40s [%s]\n" "Orion Server binary:" "$summary_server_support"
+ elif [ "$2" == "client" ]; then
+ check_os
+ find_rdma_support
+ find_qemu_kvm
+ find_docker
+
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "RDMA :" "$summary_rdma_support"
+ printf "%-40s [%s]\n" "QEMU-KVM environment :" "$summary_qemu_kvm_support"
+ printf "%-40s [%s]\n" "Docker container environment :" "$summary_docker_support"
+ elif [ "$2" == "controller" ]; then
+ check_os
+ find_etcd
+
+ echo -e "\n==============================================="
+ echo -e "Installation summaries :\n"
+ printf "%-40s [%s]\n" "OS :" "$summary_os_support"
+ printf "%-40s [%s]\n" "etcd service :" "$summary_etcd_support"
+
+ if [ $summary_etcd_support == "No" ]; then
+ echo -e "\n\033[31mOrion Controller can not be installed in this environment.\033[0m"
+ fi
+ else
+ echo "Invalid parameters."
+ print_help
+ exit 1
+ fi
+elif [ "$1" == "runtime" ]; then
+ if [ "$2" == "server" ]; then
+ find_nvidia_gpu
+ find_mps_support
+ check_oriond_runtime
+ check_controller_runtime server
+ elif [ "$2" == "client" ]; then
+ check_controller_runtime client
+ else
+ echo "Invalid parameters."
+ print_help
+ exit 1
+ fi
+else
+ echo "Invalid parameters."
+ print_help
+ exit 1
+fi
+
+
diff --git a/orion-controller b/orion-controller
new file mode 100755
index 0000000..0114616
Binary files /dev/null and b/orion-controller differ
diff --git a/orion-shm b/orion-shm
new file mode 100755
index 0000000..e0b2588
Binary files /dev/null and b/orion-shm differ
diff --git a/orion.conf.template b/orion.conf.template
new file mode 100755
index 0000000..c90d7f1
--- /dev/null
+++ b/orion.conf.template
@@ -0,0 +1,24 @@
+; this is an example of orion configuration
+[server]
+ listen_port = 9960
+ bind_addr = 127.0.0.1
+ enable_shm = "true"
+ enable_rdma = "false"
+ enable_kvm = "false"
+
+[server-log]
+ log_with_time = 1
+ log_to_screen = 0
+ log_to_file = 1
+ log_level = INFO
+ file_log_level = INFO
+
+[server-shm]
+ shm_path_base = "/dev/shm/"
+ shm_group_name = "kvm"
+ shm_user_name = "libvirt-qemu"
+ shm_buffer_size = 134217728
+
+[controller]
+ controller_addr = 127.0.0.1:9123
+
diff --git a/oriond b/oriond
new file mode 100755
index 0000000..74a4c38
Binary files /dev/null and b/oriond differ