Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo 2.0 support #293

Merged
merged 14 commits into from
Nov 20, 2024
Merged

NeMo 2.0 support #293

merged 14 commits into from
Nov 20, 2024

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Oct 29, 2024

Summary

This PR introduces support for NeMo 2.0 in CloudAI. Initially, we planned to dump fiddle configurations to a file and load them in NeMo-Run. However, I changed the approach to use NeMo-Run directly to execute a model. Marc Romejin informed me that we can run a task with a recipe without generating an sbatch script, known as a "direct executor" in NeMo-Run. To run NeMo 2.0, you can use the following command:

$ srun -t "60:00" --account=hw_nsw_misc --ntasks-per-node=8 --container-image=nvcr.io/nvidia/nemo:dev --pty nemo llm pretrain -y --factory llama3_8b trainer.max_steps=5 log.ckpt.save_on_train_epoch_end=False log.ckpt.save_last=False

Test Plan

  1. CI passes
  2. Ran on a server
$ cloudai run --system-config ~/cloudaix/conf/common/system/eos.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml                                 

/home/theo/scratch/miniconda3/envs/test4/lib/python3.9/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.20) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: nemo_run_llama3_8b
  Test Name: nemo_run_llama3_8b
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: nemo_run_llama3_8b
[INFO] Running test: nemo_run_llama3_8b

$ cd results/nemo_run_llama3_8b_2024-11-15_10-16-03/nemo_run_llama3_8b/0
$ tail stdout.txt 
        module.decoder.layers.0.self_attention.linear_proj.weight
        module.decoder.layers.0.self_attention.linear_qkv.layer_norm_weight
    Params for bucket 98 (206045184 elements):
        module.embedding.word_embeddings.weight
[NeMo I 2024-11-15 10:22:04 utils:259] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0003, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.1, fp16=False, bf16=True, params_dtype=torch.bfloat16, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.95, adam_eps=1e-05, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_grad_reduce=False, overlap_param_gather=False, overlap_param_gather_with_optimizer_step=False, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='')
Training epoch 0, iteration 0/4 | lr: 1.499e-07 | consumed_samples: 512 | global_batch_size: 512 | global_step: 0 | reduced_train_loss: 11.03 | train_step_timing in s: 61.94
Training epoch 0, iteration 1/4 | lr: 2.999e-07 | consumed_samples: 1024 | global_batch_size: 512 | global_step: 1 | reduced_train_loss: 11.03 | train_step_timing in s: 53.67
Training epoch 0, iteration 2/4 | lr: 4.498e-07 | consumed_samples: 1536 | global_batch_size: 512 | global_step: 2 | reduced_train_loss: 11.03 | train_step_timing in s: 52.45
Training epoch 0, iteration 3/4 | lr: 5.997e-07 | consumed_samples: 2048 | global_batch_size: 512 | global_step: 3 | reduced_train_loss: 11.03 | train_step_timing in s: 52.54
Training epoch 0, iteration 4/4 | lr: 7.496e-07 | consumed_samples: 2560 | global_batch_size: 512 | global_step: 4 | reduced_train_loss: 11.03 | train_step_timing in s: 53.16

@TaekyungHeo TaekyungHeo added feature Jan25 Jan'25 release feature labels Oct 29, 2024
@TaekyungHeo TaekyungHeo changed the title NeMo 2.0 NeMo 2.0 Support Oct 29, 2024
@TaekyungHeo TaekyungHeo changed the title NeMo 2.0 Support NeMo 2.0 support Oct 29, 2024
@TaekyungHeo TaekyungHeo force-pushed the nemo2.0 branch 3 times, most recently from 9349a3f to f07e0f1 Compare October 31, 2024 16:25
@TaekyungHeo TaekyungHeo force-pushed the nemo2.0 branch 21 times, most recently from e3ca13b to 68025cc Compare November 15, 2024 19:37
@TaekyungHeo TaekyungHeo force-pushed the nemo2.0 branch 2 times, most recently from a2e30ee to 15b2e87 Compare November 15, 2024 20:04
@TaekyungHeo TaekyungHeo marked this pull request as ready for review November 15, 2024 21:20
Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new case into test_acceptance.

@TaekyungHeo
Copy link
Member Author

TaekyungHeo commented Nov 18, 2024

Design Discussion (Nov 18th, 2024)

  • Srivatsan - Let's see if you can support more models with this PR; additional complexities may arise.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the conversation in the last call, the direct executor method of directly calling the srun without sbatch is what we are calling as Nemo2.0 support.

#293 (comment)

  1. Can we ensure this works with test hooks? (@TaekyungHeo If I recall, you mentioned this should be simpler in Nemo 2.0 integration with CloudIAI), if yes, as part of calling Nemo2.0 support complete, can we have an example configurations that is also tested with test hooks? Could be a different PR but I feel it should be there.

  2. If direct executor is going to be generic feature in Nemo 2.0, can we test it with other models to ensure this simpler interface holds true across different models. Zsolt seems to be running more complex models via Nemo 2.0. Can we keep these models in the radar and ensure this approach works for those models as well?

((If I recall both @TaekyungHeo and @amaslenn mentioned they are okay with this PR as is and any future PR should address it.)

So I will approve this PR but those ^ should be added to call Nemo 2.0 support complete IMO.

cc: @srinivas212

@TaekyungHeo
Copy link
Member Author

Thanks, @srivatsankrishnan.

  1. The PR Update test_acceptance to handle pre-test and non-pre-test cases for nemo-run #305 shows how pre-test works with NeMo-run. Please check.
  2. This PR shows that the direct executor works for NeMo 2.0. However, it's hard to say that we support all models in NeMo 2.0 with this PR. Some models may need additional arguments or mount points. Still, this is a valid starting point, and we can claim that the NeMo 2.0 POC is ready. When we supported NeMo 1.0, we did not support all models in the first PR. The first PR introduced the idea while supporting a single model, and we gradually improved and refactored the code when needed. We can take the same approach here, understanding that refactoring or additional changes may be required.

@TaekyungHeo TaekyungHeo merged commit 5c3fd22 into NVIDIA:main Nov 20, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Jan25 Jan'25 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants