Llama #23

jeffnvidia · 2024-05-19T14:29:25Z

Summary

Based on PR 20
enable to use Llama tests
create llama toml file and adapt NeMo gen command

There is a FIXME inside this PR (in the Llama.toml file):

FIXME : ~training.model.position_embedding_type was added in the extra_cmd_args in order to fix a bug from NeMo repository (https://github.com/NVIDIA/NeMo).
the commit that should fix this issue in NeMo is : 5b296e8af832c67d361fdfb80a165db3affaf76a.
Once the new release of NeMoLauncher includes this commit (check by downloading the corresponding container and look inside /opt for this commit), ~training.model.position_embedding_type should be removed from the extra args

Test Plan

Test by @amaslenn
CI
Test by @jeffnvidia
2.1 Slurm command generation
$ python ./cloudaix.py --mode run --system_config_path conf/v0.6/general/system/... --test_scenario_path conf/v0.6/general/test_scenario/llama/llama.toml

Additional Notes

Llama 16 nodes alone (3, 5, 8) : /auto/mtrsysgwork/jmahou/git/asap_cloudai/results/2024-05-01_12-17-26

cmd : $ python ./cloudaix.py --mode run --system_config_path conf/v0.6/general/system/... --test_scenario_path conf/v0.6/general/test_scenario/llama/llama.toml
Llama 16 nodes (3, 5, 8) + bisection noise 72 nodes : /auto/mtrsysgwork/jmahou/git/asap_cloudai/results/2024-05-01_13-42-47

cmd : $ python ./cloudaix.py --mode run --system_config_path conf/v0.6/general/system/... --test_scenario_path conf/v0.6/general/test_scenario/llama/llama_with_noise.toml

conf/v0.6/general/test/llama.toml

conf/v0.6/general/test/gpt.toml

jeffnvidia force-pushed the Llama branch 3 times, most recently from a0be834 to 56f7c46 Compare May 29, 2024 14:35

jeffnvidia force-pushed the Llama branch 2 times, most recently from ff437f8 to 5d7761a Compare June 3, 2024 13:14

TaekyungHeo reviewed Jun 3, 2024

View reviewed changes

conf/v0.6/general/test/llama.toml Outdated Show resolved Hide resolved

TaekyungHeo reviewed Jun 3, 2024

View reviewed changes

conf/v0.6/general/test/gpt.toml Outdated Show resolved Hide resolved

jeffnvidia added 3 commits June 4, 2024 08:32

support Llama testing

2b028cc

change the FIX ME comment

7044db7

remove cmd_args from gpt.toml and adjust FIXME in Llama.toml

e8fd40a

TaekyungHeo force-pushed the Llama branch from 177dfcc to e8fd40a Compare June 4, 2024 12:32

TaekyungHeo approved these changes Jun 4, 2024

View reviewed changes

TaekyungHeo merged commit 121bab7 into NVIDIA:main Jun 4, 2024
2 checks passed

jeffnvidia deleted the Llama branch July 30, 2024 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama #23

Llama #23

jeffnvidia commented May 19, 2024 •

edited by TaekyungHeo

Loading

Llama #23

Llama #23

Conversation

jeffnvidia commented May 19, 2024 • edited by TaekyungHeo Loading

Summary

Test Plan

Additional Notes

jeffnvidia commented May 19, 2024 •

edited by TaekyungHeo

Loading