Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results directory for DSE job #333

Draft
wants to merge 54 commits into
base: main
Choose a base branch
from

Conversation

srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Jan 11, 2025

Summary

This PR introduces the concept of dse_iteration and having it baked into the how the results are stored for each iteration. For each dse iteration should generate its own iteration_x folder and then the generated batch and srun scripts. There is already current_iteration and iteration field in the TestRun definition. Need more discussion on this + accordingly update the test_slurm unit test.

For normal benchmarking job, there is no concept of iterations. Hence we should expect a folder structure like this.

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

For DSE job, th

results
|--iteration_1
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh
|--iteration_2
  |--test_scenario.name
      |--Test.1
            |---0
                  |--sbatch
                  |--run.sh

Test Plan

CI/CD
Dry-Run

$ cloudai dry-run --system-config conf/common/system/example_slurm_cluster.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_jaxtoolbox.toml
.
└── dse_jaxtoolbox_grok
    ├── iteration_1
    │   └── 2025-01-11_14-28-16
    │       └── Tests.1
    │           └── 0
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_2
    │   └── 2025-01-11_14-28-17
    │       └── Tests.1
    │           └── 1
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_3
    │   └── 2025-01-11_14-28-18
    │       └── Tests.1
    │           └── 2
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_4
    │   └── 2025-01-11_14-28-19
    │       └── Tests.1
    │           └── 3
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_5
    │   └── 2025-01-11_14-28-20
    │       └── Tests.1
    │           └── 4
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_6
    │   └── 2025-01-11_14-28-21
    │       └── Tests.1
    │           └── 5
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_7
    │   └── 2025-01-11_14-28-22
    │       └── Tests.1
    │           └── 6
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_8
    │   └── 2025-01-11_14-28-23
    │       └── Tests.1
    │           └── 7
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_9
    │   └── 2025-01-11_14-28-24
    │       └── Tests.1
    │           └── 8
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_10
    │   └── 2025-01-11_14-28-25
    │       └── Tests.1
    │           └── 9
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_11
    │   └── 2025-01-11_14-28-26
    │       └── Tests.1
    │           └── 10
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_12
    │   └── 2025-01-11_14-28-27
    │       └── Tests.1
    │           └── 11
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_13
    │   └── 2025-01-11_14-28-28
    │       └── Tests.1
    │           └── 12
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_14
    │   └── 2025-01-11_14-28-29
    │       └── Tests.1
    │           └── 13
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_15
    │   └── 2025-01-11_14-28-30
    │       └── Tests.1
    │           └── 14
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_16
    │   └── 2025-01-11_14-28-31
    │       └── Tests.1
    │           └── 15
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_17
    │   └── 2025-01-11_14-28-32
    │       └── Tests.1
    │           └── 16
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_18
    │   └── 2025-01-11_14-28-33
    │       └── Tests.1
    │           └── 17
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_19
    │   └── 2025-01-11_14-28-34
    │       └── Tests.1
    │           └── 18
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_20
    │   └── 2025-01-11_14-28-35
    │       └── Tests.1
    │           └── 19
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_21
    │   └── 2025-01-11_14-28-36
    │       └── Tests.1
    │           └── 20
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_22
    │   └── 2025-01-11_14-28-37
    │       └── Tests.1
    │           └── 21
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_23
    │   └── 2025-01-11_14-28-38
    │       └── Tests.1
    │           └── 22
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_24
    │   └── 2025-01-11_14-28-39
    │       └── Tests.1
    │           └── 23
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_25
    │   └── 2025-01-11_14-28-40
    │       └── Tests.1
    │           └── 24
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_26
    │   └── 2025-01-11_14-28-42
    │       └── Tests.1
    │           └── 25
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_27
    │   └── 2025-01-11_14-28-43
    │       └── Tests.1
    │           └── 26
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_28
    │   └── 2025-01-11_14-28-44
    │       └── Tests.1
    │           └── 27
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_29
    │   └── 2025-01-11_14-28-45
    │       └── Tests.1
    │           └── 28
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_30
    │   └── 2025-01-11_14-28-46
    │       └── Tests.1
    │           └── 29
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    ├── iteration_31
    │   └── 2025-01-11_14-28-47
    │       └── Tests.1
    │           └── 30
    │               ├── cloudai_sbatch_script.sh
    │               └── run.sh
    └── iteration_32
        └── 2025-01-11_14-28-48
            └── Tests.1
                └── 31
                    ├── cloudai_sbatch_script.sh
                    └── run.sh

Additional Notes

amaslenn and others added 30 commits January 10, 2025 14:21
1. Inherit from ABC if there @abstractmethods
2. Do not make gen_srun_success_check() abstruct, simply return an empty
   string by default. When needed, this method will be overriden.
Some slurm setups do not allow running enroot from the head node. Let's
rely on actual 'enroot import' run via srun and report its real error
message to user.
* Remove conf/common/test/chakra_replay.toml

* Remove conf/common/test_scenario/chakra_replay.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants