Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser for DSE #321

Merged
merged 24 commits into from
Jan 9, 2025
Merged

Parser for DSE #321

merged 24 commits into from
Jan 9, 2025

Conversation

srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Jan 6, 2025

Summary

As discussed and converged before the break, here is the first PR that implements the parser for DSE. The ranges will be in the Test TOML instead of test scenario as discussed. The definitions of all the variables will be under the cmd_args.

[Update]: Based on the discussions with @amaslenn, The CmdArgs type is used throughput instead of having a seperate dict of str for DSE. The downside is since cmd_args is used pretty much in all the workloads, the type definitions/function headers/signatures has to be defined correctly. Bulk of the changes you see in the files are fixing the types for cmd_args to make pyright/pytest happy.

The definition of the test toml when we need to specify the ranges.

name = "example DSE"
description = "Example DSE"
test_template_name = "ExampleEnv"

[cmd_args]
"a" = [1, 16]
"b" = [1, 2, 4, 8]
"c" = [1, 2, 4, 8, 16]
"d" = [1, 2, 4, 8]
"e" = [10, 100, 500]
"f" = [10, 100, 500]
num_layers = "4"
use_fp8 = "1"

[extra_env_vars]
ENV1 = "0"
ENV2 = "1"
ENV3 = "3221225472"

The definition of test TOML when we need to specify the static ranges (as how it is defined currently).

name = "CloudAI Test"
description = "Existing CloudAI test"
test_template_name = "ExampleTemplate"

[cmd_args]
"a" = "1"
"b" = "4"
"c" = "16"
"d" = "8"
"e" =  "100"
"f" = "500"
num_layers = "4"
use_fp8 = "1"

[extra_env_vars]
ENV1 = "0"
ENV2 = "1"
ENV3 = "3221225472"

The parser will now support both specification of static values as well as list. The downstream logic will determine how it wants to use the cmd_args. For instance, if we need to enumerate all the possible values in a list (if specified), the downstream logic will implement an iterator to manipulate the cmd_args accordingly. This specification is typically handled by the agent or the environment itself. But the parser responsibility is to ensure that it parses and retains the value ranges.

Test Plan

CI/CD.
[Update]: Removing these tests based on discussion today in design meeting. For history with previous commit hash, retaining these info. The rationale was unit test that doesn't improve coverage is not useful. I would expect the code coverage for the unit test for future PRs as a requirement to determine usefulness of any unit tests.

Specifically we add the following tests for the parser logic modifications.

Test Purpose Description
cmd_args with ranges Tests that cmd_args with ranges are correctly parsed and retain lists. Asserts that the cmd_args in the ConcreteTestDefinition instance match the expected values. Creates a Test instance and asserts that cmd_args are the same as raw_cmd_args.
cmd_args with static values Tests that cmd_args with static values are correctly parsed. Asserts that the cmd_args in the ConcreteTestDefinition instance match the expected values. Creates a Test instance and asserts that cmd_args are the same as raw_cmd_args.
Generate commands with ranges Tests command generation when cmd_args contain ranges. Asserts that the generated commands match the expected commands for all combinations.
Generate commands with static values Tests command generation when cmd_args contain only static values. Asserts that the generated commands match the expected single command.
Generate commands with empty cmd_args Tests command generation with empty cmd_args. Asserts that no commands are generated.
Invalid TOML parsing: missing fields Tests that TestParser raises an error for TOML data missing required fields. Asserts that a ValidationError is raised when required fields are missing.
Invalid TOML parsing: unexpected field Tests that TestDefinition raises an error for unexpected fields in TOML. Asserts that a ValidationError is raised for extra fields.

Using Existing Grok TestDefinition in CloudAI

Test Function Name Description
test_grok_cmd_args_with_static_values Tests GrokTestDefinitionWrapper with all static values for cmd_args.
test_grok_cmd_args_with_mixed_values Tests GrokTestDefinitionWrapper with a mix of static and list values for cmd_args.
test_grok_cmd_args_with_list_values Tests GrokTestDefinitionWrapper with all list values for cmd_args.
test_grok_cmd_args_with_xla_flags_as_lists Tests GrokTestDefinitionWrapper with static FDL values and list XLA flags.
test_grok_cmd_args_with_various_types Parametrized test for GrokTestDefinitionWrapper with various types (single values and lists).
test_grok_cmd_args_with_incorrect_types Parametrized test for GrokTestDefinitionWrapper with incorrect types to ensure validation errors.

Real System Testing

Tested on sanity-grok-proxy-1 model on IL1.

...
...
...
I0108 23:23:39.034415 140503331768640 programs.py:380] [PAX STATUS]: train_step() took 0.341928 seconds.
I0108 23:23:39.034525 140503331768640 programs.py:515] steps/sec: 2.720266
I0108 23:23:39.034636 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run>: 0.37 seconds  (@ <.../paxml/executors.py:441>)
I0108 23:23:39.034677 140503331768640 executors.py:458] [PAX STATUS]:  Starting eval_step().
I0108 23:23:39.034754 140503331768640 executors.py:421] Training loop completed (step (`20`) greater than or equal to num_train_step (`20`).
I0108 23:23:39.034785 140503331768640 executors.py:559] [PAX STATUS]: Saving checkpoint for final step.
I0108 23:23:39.034812 140503331768640 checkpoint_creators.py:229] Saving a ckpt at final step: 20
I0108 23:23:39.034897 140503331768640 checkpoint.py:407] Closing _NonBlockingCheckpointMetadataStore(enable_write=True, _write_lock=<locked _thread.RLock object owner=140503331768640 count=1 at 0x7fc56bba8600>, _store_impl=<orbax.checkpoint.metadata.checkpoint._CheckpointMetadataStoreImpl object at 0x7fc56bb7c790>, _single_thread_executor=<concurrent.futures.thread.ThreadPoolExecutor object at 0x7fc56bb7cbe0>, _write_futures=[])
I0108 23:23:39.035296 140381926053440 deleter.py:176] Delete thread exited.
I0108 23:23:39.035467 140503331768640 executors.py:569] [PAX STATUS]: Final checkpoint saved.
I0108 23:23:39.035516 140503331768640 executors.py:309] [PAX STATUS]: Shutting down executor.
I0108 23:23:39.428077 140503331768640 summary_utils.py:473] Closed SummaryWriter `/opt/paxml/workspace/summaries/eval_train`.
I0108 23:23:39.428449 140503331768640 summary_utils.py:473] Closed SummaryWriter `/opt/paxml/workspace/summaries/train`.
I0108 23:23:39.428491 140503331768640 executors.py:315] [PAX STATUS]: Executor shutdown complete.
I0108 23:23:39.428538 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <train_and_evaluate>: 70.26 seconds  (@ <.../paxml/main.py:325>)
I0108 23:23:39.428574 140503331768640 py_utils.py:340] Starting sync_global_devices All tasks finish. across 8 devices globally
I0108 23:23:39.431515 140503331768640 py_utils.py:343] Finished sync_global_devices All tasks finish. across 8 devices globally
I0108 23:23:39.431588 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run_experiment>: 70.60 seconds  (@ <.../paxml/main.py:470>)
I0108 23:23:39.431623 140503331768640 py_utils.py:1040] [PAX STATUS]: Elapsed time for <run>: 70.77 seconds  (@ <.../paxml/main.py:560>)
I0108 23:23:39.431658 140503331768640 py_utils.py:1040] [PAX STATUS]: E2E time: Elapsed time for <_main>: 74.58 seconds  (@ <.../paxml/main.py:495>)
Generated:
    /opt/paxml/workspace/nsys_profile_profile.nsys-rep

Additional Notes

Local linting is passing but failing with upstream CI/CD. The pytest failing for the golden checks for acceptance test in sbatch generation. [Update]: This is fixed.

Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review whether all comments are necessary. If not, remove them, leaving only the essential ones.

tests/test_dse_parser.py Outdated Show resolved Hide resolved
src/cloudai/_core/test_template_strategy.py Outdated Show resolved Hide resolved
src/cloudai/_core/test_template_strategy.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
@TaekyungHeo
Copy link
Member

TaekyungHeo commented Jan 6, 2025

  • Unit test suggestion: I would like to see whether the list representation works with non-numeric values, such as ["string1", "string2"] or [true, false].
  • It would be a good idea to update USER_GUIDE.md since Daria is likely to request it.

@srivatsankrishnan
Copy link
Contributor Author

  • Unit test suggestion: I would like to see whether the list representation works with non-numeric values, such as ["string1", "string2"] or [true, false].
  • It would be a good idea to update USER_GUIDE.md since Daria is likely to request it.

Added more unit tests for 1st point. Regarding the USER_GUIDE, yes, will be added after the End-to-End integration. Before that and doing it in this PR is premature.

src/cloudai/_core/test.py Outdated Show resolved Hide resolved
tests/test_dse_parser.py Outdated Show resolved Hide resolved
@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review January 8, 2025 08:13
src/cloudai/_core/test.py Outdated Show resolved Hide resolved
tests/dse/test_dse_parser_grok.py Outdated Show resolved Hide resolved
TaekyungHeo
TaekyungHeo previously approved these changes Jan 8, 2025
src/cloudai/_core/test.py Outdated Show resolved Hide resolved
@srivatsankrishnan srivatsankrishnan merged commit 8e9935b into NVIDIA:main Jan 9, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants