Skip to content

Commit

Permalink
Merge pull request #417 from oracle-japan/develop
Browse files Browse the repository at this point in the history
Update the OCI tutorilas:8a63149d067c3db56941fe2b93ff3df970242fc4
  • Loading branch information
fwiw6430 authored Jul 9, 2024
2 parents 07a25a5 + 8a63149 commit 3fb022f
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 73 deletions.
2 changes: 1 addition & 1 deletion tutorials/_hpc/benchmark/run-nccltests.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ header:
***
# 0. 概要

本ドキュメントで解説する **[NCCL Tests](https://github.com/nvidia/nccl-tests)** の実行は、GPUクラスタ上に **Docker Community Edition****[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html)** で構築されたコンテナ実行環境で**[NGC Catalog](https://catalog.ngc.nvidia.com/)** から提供される **[TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)** を起動し、このコンテナに含まれる **[NCCL(NVIDIA Collective Communication Library)](https://developer.nvidia.com/nccl)** とコンテナ上でビルドする **NCCL Tests** を使用します。
本ドキュメントで解説する **[NCCL Tests](https://github.com/nvidia/nccl-tests)** の実行は、GPUクラスタ上に **Docker Community Edition****[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html)** で構築されたコンテナ実行環境で **[TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)** を起動し、このコンテナに含まれる **[NCCL(NVIDIA Collective Communication Library)](https://developer.nvidia.com/nccl)** とコンテナ上でビルドする **NCCL Tests** を使用します。

本ドキュメントで **NCCL Tests** を実行するGPUクラスタは、2インスタンスのGPUワークロード向けベアメタルシェイプ **[BM.GPU4.8/BM.GPU.A100-v2.8](https://docs.oracle.com/ja-jp/iaas/Content/Compute/References/computeshapes.htm#bm-gpu)** を **[クラスタ・ネットワーク](/ocitutorials/hpc/#5-1-クラスタネットワーク)** で接続した構成とし、 **[OCI HPCチュートリアル集](/ocitutorials/hpc/#1-oci-hpcチュートリアル集)** のカテゴリ **[機械学習環境](/ocitutorials/hpc/#1-2-機械学習環境)** のチュートリアル **[GPUクラスタを構築する(基礎インフラ手動構築編)](/ocitutorials/hpc/spinup-gpu-cluster/)** や **[GPUクラスタを構築する(基礎インフラ自動構築編)](/ocitutorials/hpc/spinup-gpu-cluster-withterraform/)** の手順に従う等により、GPUノード上で **Docker Community Edition** と **NVIDIA Container Toolkit** を使用してコンテナからGPUが利用可能な環境を予め用意します。

Expand Down
109 changes: 37 additions & 72 deletions tutorials/_hpc/spinup-gpu-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,8 @@ header:

[ソフトウェア]
- コンテナランタイム : **Docker Community Edition** 26.1.3
- コンテナ : **[TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)** 24.06-tf2-py3 from **[NGC Catalog](https://catalog.ngc.nvidia.com/)**
- **NVIDIA Container Toolkit** : 1.15.0
- コンテナ : **[TensorFlow NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow)** 24.06-tf2-py3

[クラスタ管理]
- 共有ストレージ : BastionノードをNFSサーバとするGPUクラスタ内ホームディレクトリ共有
Expand Down Expand Up @@ -387,83 +388,47 @@ $
以下コマンドをBastionノードのopcユーザで実行し、 **[クラスタ・ネットワーク](/ocitutorials/hpc/#5-1-クラスタネットワーク)** 接続用の16個のネットワークインターフェース(rdmax)にTCP/IP接続用のネットワークインターフェース(eth0)と4フィールド目が同じ10.224.[0 - 15].x/12のIPアドレスが設定されていることを確認します。

```sh
$ pdsh -g all 'ip a | grep -e eth0 -e rdma' | dshbak -c
$ pdsh -g all 'ip a | grep -e eth0 -e rdma | grep inet' | dshbak -c
----------------
inst-xxxxx-gpu4-ol89
----------------
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
inet 10.0.2.125/24 brd 10.0.2.255 scope global dynamic eth0
22: rdma0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.0.125/12 brd 10.239.255.255 scope global noprefixroute rdma0
23: rdma1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.1.125/12 brd 10.239.255.255 scope global noprefixroute rdma1
26: rdma2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.2.125/12 brd 10.239.255.255 scope global noprefixroute rdma2
27: rdma3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.3.125/12 brd 10.239.255.255 scope global noprefixroute rdma3
30: rdma4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.4.125/12 brd 10.239.255.255 scope global noprefixroute rdma4
31: rdma5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.5.125/12 brd 10.239.255.255 scope global noprefixroute rdma5
34: rdma6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.6.125/12 brd 10.239.255.255 scope global noprefixroute rdma6
35: rdma7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.7.125/12 brd 10.239.255.255 scope global noprefixroute rdma7
38: rdma8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.8.125/12 brd 10.239.255.255 scope global noprefixroute rdma8
39: rdma9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.9.125/12 brd 10.239.255.255 scope global noprefixroute rdma9
42: rdma10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.10.125/12 brd 10.239.255.255 scope global noprefixroute rdma10
43: rdma11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.11.125/12 brd 10.239.255.255 scope global noprefixroute rdma11
46: rdma12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.12.125/12 brd 10.239.255.255 scope global noprefixroute rdma12
47: rdma13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.13.125/12 brd 10.239.255.255 scope global noprefixroute rdma13
50: rdma14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.14.125/12 brd 10.239.255.255 scope global noprefixroute rdma14
51: rdma15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.15.125/12 brd 10.239.255.255 scope global noprefixroute rdma15

inet 10.0.2.117/24 brd 10.0.2.255 scope global dynamic eth0
inet 10.224.0.117/12 brd 10.239.255.255 scope global noprefixroute rdma0
inet 10.224.1.117/12 brd 10.239.255.255 scope global noprefixroute rdma1
inet 10.224.2.117/12 brd 10.239.255.255 scope global noprefixroute rdma2
inet 10.224.3.117/12 brd 10.239.255.255 scope global noprefixroute rdma3
inet 10.224.4.117/12 brd 10.239.255.255 scope global noprefixroute rdma4
inet 10.224.5.117/12 brd 10.239.255.255 scope global noprefixroute rdma5
inet 10.224.6.117/12 brd 10.239.255.255 scope global noprefixroute rdma6
inet 10.224.7.117/12 brd 10.239.255.255 scope global noprefixroute rdma7
inet 10.224.8.117/12 brd 10.239.255.255 scope global noprefixroute rdma8
inet 10.224.9.117/12 brd 10.239.255.255 scope global noprefixroute rdma9
inet 10.224.10.117/12 brd 10.239.255.255 scope global noprefixroute rdma10
inet 10.224.11.117/12 brd 10.239.255.255 scope global noprefixroute rdma11
inet 10.224.12.117/12 brd 10.239.255.255 scope global noprefixroute rdma12
inet 10.224.13.117/12 brd 10.239.255.255 scope global noprefixroute rdma13
inet 10.224.14.117/12 brd 10.239.255.255 scope global noprefixroute rdma14
inet 10.224.15.117/12 brd 10.239.255.255 scope global noprefixroute rdma15
----------------
inst-yyyyy-gpu4-ol89
----------------
6: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000
inet 10.0.2.25/24 brd 10.0.2.255 scope global dynamic eth0
22: rdma0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.0.25/12 brd 10.239.255.255 scope global noprefixroute rdma0
23: rdma1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.1.25/12 brd 10.239.255.255 scope global noprefixroute rdma1
26: rdma2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.2.25/12 brd 10.239.255.255 scope global noprefixroute rdma2
27: rdma3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.3.25/12 brd 10.239.255.255 scope global noprefixroute rdma3
30: rdma4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.4.25/12 brd 10.239.255.255 scope global noprefixroute rdma4
31: rdma5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.5.25/12 brd 10.239.255.255 scope global noprefixroute rdma5
34: rdma6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.6.25/12 brd 10.239.255.255 scope global noprefixroute rdma6
35: rdma7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.7.25/12 brd 10.239.255.255 scope global noprefixroute rdma7
38: rdma8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.8.25/12 brd 10.239.255.255 scope global noprefixroute rdma8
39: rdma9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.9.25/12 brd 10.239.255.255 scope global noprefixroute rdma9
42: rdma10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.10.25/12 brd 10.239.255.255 scope global noprefixroute rdma10
43: rdma11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.11.25/12 brd 10.239.255.255 scope global noprefixroute rdma11
46: rdma12: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.12.25/12 brd 10.239.255.255 scope global noprefixroute rdma12
47: rdma13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.13.25/12 brd 10.239.255.255 scope global noprefixroute rdma13
50: rdma14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.14.25/12 brd 10.239.255.255 scope global noprefixroute rdma14
51: rdma15: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 4220 qdisc mq state UP group default qlen 20000
inet 10.224.15.25/12 brd 10.239.255.255 scope global noprefixroute rdma15

inet 10.0.2.17/24 brd 10.0.2.255 scope global dynamic eth0
inet 10.224.0.17/12 brd 10.239.255.255 scope global noprefixroute rdma0
inet 10.224.1.17/12 brd 10.239.255.255 scope global noprefixroute rdma1
inet 10.224.2.17/12 brd 10.239.255.255 scope global noprefixroute rdma2
inet 10.224.3.17/12 brd 10.239.255.255 scope global noprefixroute rdma3
inet 10.224.4.17/12 brd 10.239.255.255 scope global noprefixroute rdma4
inet 10.224.5.17/12 brd 10.239.255.255 scope global noprefixroute rdma5
inet 10.224.6.17/12 brd 10.239.255.255 scope global noprefixroute rdma6
inet 10.224.7.17/12 brd 10.239.255.255 scope global noprefixroute rdma7
inet 10.224.8.17/12 brd 10.239.255.255 scope global noprefixroute rdma8
inet 10.224.9.17/12 brd 10.239.255.255 scope global noprefixroute rdma9
inet 10.224.10.17/12 brd 10.239.255.255 scope global noprefixroute rdma10
inet 10.224.11.17/12 brd 10.239.255.255 scope global noprefixroute rdma11
inet 10.224.12.17/12 brd 10.239.255.255 scope global noprefixroute rdma12
inet 10.224.13.17/12 brd 10.239.255.255 scope global noprefixroute rdma13
inet 10.224.14.17/12 brd 10.239.255.255 scope global noprefixroute rdma14
inet 10.224.15.17/12 brd 10.239.255.255 scope global noprefixroute rdma15
$
```

Expand Down

0 comments on commit 3fb022f

Please sign in to comment.