Parser for DSE #321

srivatsankrishnan · 2025-01-06T09:19:12Z

Summary

As discussed and converged before the break, here is the first PR that implements the parser for DSE. The ranges will be in the Test TOML instead of test scenario as discussed. The definitions of all the variables will be under the cmd_args.

[Update]: Based on the discussions with @amaslenn, The CmdArgs type is used throughput instead of having a seperate dict of str for DSE. The downside is since cmd_args is used pretty much in all the workloads, the type definitions/function headers/signatures has to be defined correctly. Bulk of the changes you see in the files are fixing the types for cmd_args to make pyright/pytest happy.

The definition of the test toml when we need to specify the ranges.

name = "example DSE"
description = "Example DSE"
test_template_name = "ExampleEnv"

[cmd_args]
"a" = [1, 16]
"b" = [1, 2, 4, 8]
"c" = [1, 2, 4, 8, 16]
"d" = [1, 2, 4, 8]
"e" = [10, 100, 500]
"f" = [10, 100, 500]
num_layers = "4"
use_fp8 = "1"

[extra_env_vars]
ENV1 = "0"
ENV2 = "1"
ENV3 = "3221225472"

The definition of test TOML when we need to specify the static ranges (as how it is defined currently).

name = "CloudAI Test"
description = "Existing CloudAI test"
test_template_name = "ExampleTemplate"

[cmd_args]
"a" = "1"
"b" = "4"
"c" = "16"
"d" = "8"
"e" =  "100"
"f" = "500"
num_layers = "4"
use_fp8 = "1"

[extra_env_vars]
ENV1 = "0"
ENV2 = "1"
ENV3 = "3221225472"

The parser will now support both specification of static values as well as list. The downstream logic will determine how it wants to use the cmd_args. For instance, if we need to enumerate all the possible values in a list (if specified), the downstream logic will implement an iterator to manipulate the cmd_args accordingly. This specification is typically handled by the agent or the environment itself. But the parser responsibility is to ensure that it parses and retains the value ranges.

Test Plan

CI/CD.
[Update]: Removing these tests based on discussion today in design meeting. For history with previous commit hash, retaining these info. The rationale was unit test that doesn't improve coverage is not useful. I would expect the code coverage for the unit test for future PRs as a requirement to determine usefulness of any unit tests.

Specifically we add the following tests for the parser logic modifications.

Test Purpose	Description
cmd_args with ranges	Tests that `cmd_args` with ranges are correctly parsed and retain lists. Asserts that the `cmd_args` in the `ConcreteTestDefinition` instance match the expected values. Creates a `Test` instance and asserts that `cmd_args` are the same as `raw_cmd_args`.
cmd_args with static values	Tests that `cmd_args` with static values are correctly parsed. Asserts that the `cmd_args` in the `ConcreteTestDefinition` instance match the expected values. Creates a `Test` instance and asserts that `cmd_args` are the same as `raw_cmd_args`.
Generate commands with ranges	Tests command generation when `cmd_args` contain ranges. Asserts that the generated commands match the expected commands for all combinations.
Generate commands with static values	Tests command generation when `cmd_args` contain only static values. Asserts that the generated commands match the expected single command.
Generate commands with empty cmd_args	Tests command generation with empty `cmd_args`. Asserts that no commands are generated.
Invalid TOML parsing: missing fields	Tests that `TestParser` raises an error for TOML data missing required fields. Asserts that a `ValidationError` is raised when required fields are missing.
Invalid TOML parsing: unexpected field	Tests that `TestDefinition` raises an error for unexpected fields in TOML. Asserts that a `ValidationError` is raised for extra fields.

Using Existing Grok TestDefinition in CloudAI

Test Function Name	Description
test_grok_cmd_args_with_static_values	Tests `GrokTestDefinitionWrapper` with all static values for `cmd_args`.
test_grok_cmd_args_with_mixed_values	Tests `GrokTestDefinitionWrapper` with a mix of static and list values for `cmd_args`.
test_grok_cmd_args_with_list_values	Tests `GrokTestDefinitionWrapper` with all list values for `cmd_args`.
test_grok_cmd_args_with_xla_flags_as_lists	Tests `GrokTestDefinitionWrapper` with static FDL values and list XLA flags.
test_grok_cmd_args_with_various_types	Parametrized test for `GrokTestDefinitionWrapper` with various types (single values and lists).
test_grok_cmd_args_with_incorrect_types	Parametrized test for `GrokTestDefinitionWrapper` with incorrect types to ensure validation errors.

Real System Testing

Tested on sanity-grok-proxy-1 model on IL1.

...
...
...
I0108 23:23:39.034415 140503331768640 programs.py:380] [PAX STATUS]: train_step() took 0.341928 seconds.
I0108 23:23:39.034525 140503331768640 programs.py:515] steps/sec: 2.720266
I0108 23:23:39.034636 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run>: 0.37 seconds  (@ <.../paxml/executors.py:441>)
I0108 23:23:39.034677 140503331768640 executors.py:458] [PAX STATUS]:  Starting eval_step().
I0108 23:23:39.034754 140503331768640 executors.py:421] Training loop completed (step (`20`) greater than or equal to num_train_step (`20`).
I0108 23:23:39.034785 140503331768640 executors.py:559] [PAX STATUS]: Saving checkpoint for final step.
I0108 23:23:39.034812 140503331768640 checkpoint_creators.py:229] Saving a ckpt at final step: 20
I0108 23:23:39.034897 140503331768640 checkpoint.py:407] Closing _NonBlockingCheckpointMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=140503331768640 count=1 at 0x7fc56bba8600>, _store_impl=<orbax.checkpoint.metadata.checkpoint._CheckpointMetadataStoreImpl object at 0x7fc56bb7c790>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fc56bb7cbe0>, _write_futures=[])
I0108 23:23:39.035296 140381926053440 deleter.py:176] Delete thread exited.
I0108 23:23:39.035467 140503331768640 executors.py:569] [PAX STATUS]: Final checkpoint saved.
I0108 23:23:39.035516 140503331768640 executors.py:309] [PAX STATUS]: Shutting down executor.
I0108 23:23:39.428077 140503331768640 summary_utils.py:473] Closed SummaryWriter `/opt/paxml/workspace/summaries/eval_train`.
I0108 23:23:39.428449 140503331768640 summary_utils.py:473] Closed SummaryWriter `/opt/paxml/workspace/summaries/train`.
I0108 23:23:39.428491 140503331768640 executors.py:315] [PAX STATUS]: Executor shutdown complete.
I0108 23:23:39.428538 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <train_and_evaluate>: 70.26 seconds  (@ <.../paxml/main.py:325>)
I0108 23:23:39.428574 140503331768640 py_utils.py:340] Starting sync_global_devices All tasks finish. across 8 devices globally
I0108 23:23:39.431515 140503331768640 py_utils.py:343] Finished sync_global_devices All tasks finish. across 8 devices globally
I0108 23:23:39.431588 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run_experiment>: 70.60 seconds  (@ <.../paxml/main.py:470>)
I0108 23:23:39.431623 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run>: 70.77 seconds  (@ <.../paxml/main.py:560>)
I0108 23:23:39.431658 140503331768640 py_utils.py:1040] [PAX STATUS]: E2E time: Elapsed time for <_main>: 74.58 seconds  (@ <.../paxml/main.py:495>)
Generated:
    /opt/paxml/workspace/nsys_profile_profile.nsys-rep

Additional Notes

Local linting is passing but failing with upstream CI/CD. The pytest failing for the golden checks for acceptance test in sbatch generation. [Update]: This is fixed.

TaekyungHeo

Please review whether all comments are necessary. If not, remove them, leaving only the essential ones.

tests/test_dse_parser.py

src/cloudai/_core/test_template_strategy.py

tests/test_dse_parser.py

TaekyungHeo · 2025-01-06T17:00:57Z

Unit test suggestion: I would like to see whether the list representation works with non-numeric values, such as ["string1", "string2"] or [true, false].
It would be a good idea to update USER_GUIDE.md since Daria is likely to request it.

src/cloudai/_core/test_parser.py

src/cloudai/_core/test_template_strategy.py

srivatsankrishnan · 2025-01-07T08:10:25Z

Unit test suggestion: I would like to see whether the list representation works with non-numeric values, such as ["string1", "string2"] or [true, false].

It would be a good idea to update USER_GUIDE.md since Daria is likely to request it.

Added more unit tests for 1st point. Regarding the USER_GUIDE, yes, will be added after the End-to-End integration. Before that and doing it in this PR is premature.

src/cloudai/_core/test.py

tests/test_dse_parser.py

…e of the variables that is list.

…axtookbox definitions

src/cloudai/_core/test.py

src/cloudai/systems/slurm/strategy/slurm_command_gen_strategy.py

tests/dse/test_dse_parser_grok.py

src/cloudai/_core/test.py

srivatsankrishnan added 2 commits January 5, 2025 23:36

Initial port from internal repo

423f1f3

More unit tests

362e8f3

TaekyungHeo requested changes Jan 6, 2025

View reviewed changes

amaslenn reviewed Jan 6, 2025

View reviewed changes

src/cloudai/_core/test_parser.py Outdated Show resolved Hide resolved

amaslenn reviewed Jan 6, 2025

View reviewed changes

src/cloudai/_core/test_template_strategy.py Show resolved Hide resolved

ruff changes

74079b2

srivatsankrishnan force-pushed the dse_parser branch from 041f4c7 to 74079b2 Compare January 6, 2025 17:30

srivatsankrishnan added 3 commits January 6, 2025 21:21

remove comments

51e6d0a

more unit tests for non-integer values and mixed type values.

6983f41

fix for test_acceptance.py pytest failure.

9105eea

Merge branch 'main' into dse_parser

7737375

amaslenn reviewed Jan 7, 2025

View reviewed changes

src/cloudai/_core/test.py Outdated Show resolved Hide resolved

tests/test_dse_parser.py Outdated Show resolved Hide resolved

srivatsankrishnan added 12 commits January 7, 2025 17:16

remove the default condition check

8072173

preserves lists in cmd_args as is (for pydantic validation)

5c83b73

propate cmd_args type to all places in cloudAI for pyright errors

3c5253f

Add ClassVar to remove pydantic annonation error

8d6bccb

fix pytest

16282ff

Merge branch 'main' into dse_parser

0aef8ad

more unit tests for parser with Grok Test definition + pydantic of on…

ad8b7e5

…e of the variables that is list.

ruffing

f3e7295

Add more test to have ranges for FDL flags.

01bd99e

More test for XLA flags as list other fixed + fixing typing in Grok/J…

eac7353

…axtookbox definitions

All static values (benchmarking scenarios in CloudAI)

0dbcdcf

negative tests with various types in the list

b42e4f1

srivatsankrishnan marked this pull request as ready for review January 8, 2025 08:13

amaslenn reviewed Jan 8, 2025

View reviewed changes

src/cloudai/_core/test.py Outdated Show resolved Hide resolved

src/cloudai/systems/slurm/strategy/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

tests/dse/test_dse_parser_grok.py Outdated Show resolved Hide resolved

srivatsankrishnan added 2 commits January 8, 2025 11:45

remove the unit tests

490af18

remove instance check (assuming model_dump() never fails)

97f7096

fix the typing for slurm_args

ae32060

srivatsankrishnan requested a review from TaekyungHeo January 8, 2025 20:05

removing the old _parser_cmd method that is not used.

4cb6a05

TaekyungHeo previously approved these changes Jan 8, 2025

View reviewed changes

TaekyungHeo added the feature label Jan 8, 2025

amaslenn reviewed Jan 9, 2025

View reviewed changes

src/cloudai/_core/test.py Outdated Show resolved Hide resolved

Remove the cmd_args typing

7735d8b

srivatsankrishnan dismissed TaekyungHeo’s stale review via 7735d8b January 9, 2025 16:48

amaslenn approved these changes Jan 9, 2025

View reviewed changes

TaekyungHeo approved these changes Jan 9, 2025

View reviewed changes

srivatsankrishnan merged commit 8e9935b into NVIDIA:main Jan 9, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser for DSE #321

Parser for DSE #321

srivatsankrishnan commented Jan 6, 2025 •

edited

Loading

TaekyungHeo left a comment

TaekyungHeo commented Jan 6, 2025 •

edited

Loading

srivatsankrishnan commented Jan 7, 2025

Parser for DSE #321

Parser for DSE #321

Conversation

srivatsankrishnan commented Jan 6, 2025 • edited Loading

Summary

Test Plan

Using Existing Grok TestDefinition in CloudAI

Real System Testing

Additional Notes

TaekyungHeo left a comment

Choose a reason for hiding this comment

TaekyungHeo commented Jan 6, 2025 • edited Loading

srivatsankrishnan commented Jan 7, 2025

srivatsankrishnan commented Jan 6, 2025 •

edited

Loading

TaekyungHeo commented Jan 6, 2025 •

edited

Loading