Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add multinode support via slurm trainer, large scale race condition fix #63

Merged
merged 4 commits into from
Feb 22, 2024

Conversation

lessw2020
Copy link
Contributor

This PR
1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists.
This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder).

Screenshot 2024-02-15 at 10 53 18 PM Screenshot 2024-02-15 at 10 55 02 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 16, 2024
multinode_trainer.slurm Show resolved Hide resolved
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!

@lessw2020 lessw2020 merged commit 70be86e into pytorch:main Feb 22, 2024
3 checks passed
@lessw2020 lessw2020 deleted the expand_multi_node branch February 22, 2024 18:31
lessw2020 added a commit that referenced this pull request Apr 18, 2024
…ix (#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
…ix (pytorch#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants