add multinode support via slurm trainer, large scale race condition fix #63

lessw2020 · 2024-02-16T17:27:43Z

This PR
1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists.
This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder).

multinode_trainer.slurm

wanchaol

Awesome!!

multinode_trainer.slurm

…ix (#63) This PR 1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s. 2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists. This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder). <img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5"> <img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">

…ix (pytorch#63) This PR 1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s. 2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists. This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder). <img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5"> <img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">

add multinode support via slurm trainer

6ee8941

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 16, 2024

lessw2020 requested review from wanchaol, wconstab and tianyu-l February 16, 2024 17:28

updated license agreement

d345408

wconstab approved these changes Feb 19, 2024

View reviewed changes

multinode_trainer.slurm Show resolved Hide resolved

wanchaol approved these changes Feb 21, 2024

View reviewed changes

wanchaol reviewed Feb 21, 2024

View reviewed changes

multinode_trainer.slurm Show resolved Hide resolved

lessw2020 added 2 commits February 22, 2024 09:16

Merge branch 'pytorch-labs:main' into expand_multi_node

5dda674

add info comments in readme and slurm file for usage tips.

3e59daf

lessw2020 merged commit 70be86e into pytorch:main Feb 22, 2024
3 checks passed

lessw2020 deleted the expand_multi_node branch February 22, 2024 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add multinode support via slurm trainer, large scale race condition fix #63

add multinode support via slurm trainer, large scale race condition fix #63

lessw2020 commented Feb 16, 2024

wanchaol left a comment

add multinode support via slurm trainer, large scale race condition fix #63

add multinode support via slurm trainer, large scale race condition fix #63

Conversation

lessw2020 commented Feb 16, 2024

wanchaol left a comment

Choose a reason for hiding this comment