Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for invalid job id for many parallel cloudAI jobs #314

Merged
merged 6 commits into from
Dec 20, 2024

Conversation

srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Dec 17, 2024

Summary

The current cloudAI slurm runner fails when many parallel jobs are submitted. The current strategy is to use squeue with the job id. However, when many CloudAI jobs are submitted in parallel, some of these jobs might complete at the same time resulting in invalid job id error by the time CloudAI queries this status.

This PR fixes this issue by using an alternative way to determine the job completion status instead of squeue. I

Test Plan

CI/CD

Run on real system (Job completion status works and moves on to the next job).

(venv) $ python ./clouda
ix.py run --system-config conf/common/system/xxxx--tests-dir conf/common/test --test-scenario conf/nccl_test_nightly.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
Section Name: Tests.2
  Test Name: nccl_test_all_gather
  Description: all_gather
  Start Post Comp: Tests.1
Section Name: Tests.3
  Test Name: nccl_test_reduce_scatter
  Description: reduce_scatter
  Start Post Comp: Tests.2
Section Name: Tests.4
  Test Name: nccl_test_alltoall
  Description: alltoall
  Start Post Comp: Tests.3
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1262768
[INFO] Job completed: Tests.1
[INFO] Delayed start for test Tests.2 by 5 seconds.
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Submitted slurm job: 1262771
[INFO] Job completed: Tests.2
[INFO] Delayed start for test Tests.3 by 5 seconds.
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Submitted slurm job: 1262775

Stress test on another internal cluster with simultaneous 111 job submission using CloudAI

$python utils/dgx.py --auto_hosts --partition batch --max_num_nodes=4 --test_path conf/staging/acceptance_test/test --test_scenario_dir conf/staging/acceptance_test/test_scenario/ --output_dir results --mode run
No config file provided, proceeding with command-line arguments if available.
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: xxxxx
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: xxxx

Section Name: Tests.0
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.1
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.2
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.3
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.4
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.5
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.6
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.7
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.8
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.9
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.10
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.11
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.12
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.13
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.14
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.15
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.16
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.17
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.18
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.19
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.20
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.21
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.22
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.23
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.24
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.25
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.26
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.27
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.28
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.29
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.30
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.31
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.32
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.33
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.34
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.35
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.36
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.37
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.38
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.39
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.40
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.41
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.42
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.43
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.44
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.45
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.46
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.47
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.48
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.49
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.50
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.51
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.52
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.53
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.54
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.55
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.56
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.57
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.58
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.59
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.60
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.61
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.62
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.63
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.64
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.65
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.66
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.67
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.68
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.69
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.70
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.71
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.72
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.73
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.74
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.75
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.76
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.77
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.78
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.79
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.80
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.81
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.82
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.83
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.84
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.85
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.86
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.87
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.88
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.89
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.90
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.91
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.92
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.93
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.94
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.95
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.96
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.97
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.98
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.99
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.100
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.101
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.102
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.103
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.104
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.105
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.106
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.107
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.108
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.109
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.110
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.111
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.0
[INFO] Running test: Tests.0
[INFO] Submitted slurm job: 1655641
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1655642
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Submitted slurm job: 1655643
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Submitted slurm job: 1655644
[INFO] Starting test: Tests.4
[INFO] Running test: Tests.4
[INFO] Submitted slurm job: 1655645
[INFO] Starting test: Tests.5
[INFO] Running test: Tests.5
[INFO] Submitted slurm job: 1655646
[INFO] Starting test: Tests.6
[INFO] Running test: Tests.6
[INFO] Submitted slurm job: 1655647
[INFO] Starting test: Tests.7
[INFO] Running test: Tests.7
[INFO] Submitted slurm job: 1655648
[INFO] Starting test: Tests.8
[INFO] Running test: Tests.8
[INFO] Submitted slurm job: 1655649
[INFO] Starting test: Tests.9
[INFO] Running test: Tests.9
[INFO] Submitted slurm job: 1655650
[INFO] Starting test: Tests.10
[INFO] Running test: Tests.10
[INFO] Submitted slurm job: 1655651
[INFO] Starting test: Tests.11
[INFO] Running test: Tests.11
[INFO] Submitted slurm job: 1655652
[INFO] Starting test: Tests.12
[INFO] Running test: Tests.12
[INFO] Submitted slurm job: 1655653
[INFO] Starting test: Tests.13
[INFO] Running test: Tests.13
[INFO] Submitted slurm job: 1655654
[INFO] Starting test: Tests.14
[INFO] Running test: Tests.14
[INFO] Submitted slurm job: 1655655
[INFO] Starting test: Tests.15
[INFO] Running test: Tests.15
[INFO] Submitted slurm job: 1655656
[INFO] Starting test: Tests.16
[INFO] Running test: Tests.16
[INFO] Submitted slurm job: 1655657
[INFO] Starting test: Tests.17
[INFO] Running test: Tests.17
[INFO] Submitted slurm job: 1655658
[INFO] Starting test: Tests.18
[INFO] Running test: Tests.18
[INFO] Submitted slurm job: 1655659
[INFO] Starting test: Tests.19
[INFO] Running test: Tests.19
[INFO] Submitted slurm job: 1655660
[INFO] Starting test: Tests.20
[INFO] Running test: Tests.20
[INFO] Submitted slurm job: 1655661
[INFO] Starting test: Tests.21
[INFO] Running test: Tests.21
[INFO] Submitted slurm job: 1655662
[INFO] Starting test: Tests.22
[INFO] Running test: Tests.22
[INFO] Submitted slurm job: 1655663
[INFO] Starting test: Tests.23
[INFO] Running test: Tests.23
[INFO] Submitted slurm job: 1655664
[INFO] Starting test: Tests.24
[INFO] Running test: Tests.24
[INFO] Submitted slurm job: 1655665
[INFO] Starting test: Tests.25
[INFO] Running test: Tests.25
[INFO] Submitted slurm job: 1655666
[INFO] Starting test: Tests.26
[INFO] Running test: Tests.26
[INFO] Submitted slurm job: 1655667
[INFO] Starting test: Tests.27
[INFO] Running test: Tests.27
[INFO] Submitted slurm job: 1655668
[INFO] Starting test: Tests.28
[INFO] Running test: Tests.28
[INFO] Submitted slurm job: 1655669
[INFO] Starting test: Tests.29
[INFO] Running test: Tests.29
[INFO] Submitted slurm job: 1655670
[INFO] Starting test: Tests.30
[INFO] Running test: Tests.30
[INFO] Submitted slurm job: 1655671
[INFO] Starting test: Tests.31
[INFO] Running test: Tests.31
[INFO] Submitted slurm job: 1655672
[INFO] Starting test: Tests.32
[INFO] Running test: Tests.32
[INFO] Submitted slurm job: 1655673
[INFO] Starting test: Tests.33
[INFO] Running test: Tests.33
[INFO] Submitted slurm job: 1655674
[INFO] Starting test: Tests.34
[INFO] Running test: Tests.34
[INFO] Submitted slurm job: 1655675
[INFO] Starting test: Tests.35
[INFO] Running test: Tests.35
[INFO] Submitted slurm job: 1655676
[INFO] Starting test: Tests.36
[INFO] Running test: Tests.36
[INFO] Submitted slurm job: 1655677
[INFO] Starting test: Tests.37
[INFO] Running test: Tests.37
[INFO] Submitted slurm job: 1655678
[INFO] Starting test: Tests.38
[INFO] Running test: Tests.38
[INFO] Submitted slurm job: 1655679
[INFO] Starting test: Tests.39
[INFO] Running test: Tests.39
[INFO] Submitted slurm job: 1655680
[INFO] Starting test: Tests.40
[INFO] Running test: Tests.40
[INFO] Submitted slurm job: 1655681
[INFO] Starting test: Tests.41
[INFO] Running test: Tests.41
[INFO] Submitted slurm job: 1655682
[INFO] Starting test: Tests.42
[INFO] Running test: Tests.42
[INFO] Submitted slurm job: 1655683
[INFO] Starting test: Tests.43
[INFO] Running test: Tests.43
[INFO] Submitted slurm job: 1655684
[INFO] Starting test: Tests.44
[INFO] Running test: Tests.44
[INFO] Submitted slurm job: 1655685
[INFO] Starting test: Tests.45
[INFO] Running test: Tests.45
[INFO] Submitted slurm job: 1655686
[INFO] Starting test: Tests.46
[INFO] Running test: Tests.46
[INFO] Submitted slurm job: 1655687
[INFO] Starting test: Tests.47
[INFO] Running test: Tests.47
[INFO] Submitted slurm job: 1655688
[INFO] Starting test: Tests.48
[INFO] Running test: Tests.48
[INFO] Submitted slurm job: 1655689
[INFO] Starting test: Tests.49
[INFO] Running test: Tests.49
[INFO] Submitted slurm job: 1655690
[INFO] Starting test: Tests.50
[INFO] Running test: Tests.50
[INFO] Submitted slurm job: 1655691
[INFO] Starting test: Tests.51
[INFO] Running test: Tests.51
[INFO] Submitted slurm job: 1655692
[INFO] Starting test: Tests.52
[INFO] Running test: Tests.52
[INFO] Submitted slurm job: 1655693
[INFO] Starting test: Tests.53
[INFO] Running test: Tests.53
[INFO] Submitted slurm job: 1655694
[INFO] Starting test: Tests.54
[INFO] Running test: Tests.54
[INFO] Submitted slurm job: 1655695
[INFO] Starting test: Tests.55
[INFO] Running test: Tests.55
[INFO] Submitted slurm job: 1655696
[INFO] Starting test: Tests.56
[INFO] Running test: Tests.56
[INFO] Submitted slurm job: 1655697
[INFO] Starting test: Tests.57
[INFO] Running test: Tests.57
[INFO] Submitted slurm job: 1655698
[INFO] Starting test: Tests.58
[INFO] Running test: Tests.58
[INFO] Submitted slurm job: 1655699
[INFO] Starting test: Tests.59
[INFO] Running test: Tests.59
[INFO] Submitted slurm job: 1655700
[INFO] Starting test: Tests.60
[INFO] Running test: Tests.60
[INFO] Submitted slurm job: 1655701
[INFO] Starting test: Tests.61
[INFO] Running test: Tests.61
[INFO] Submitted slurm job: 1655702
[INFO] Starting test: Tests.62
[INFO] Running test: Tests.62
[INFO] Submitted slurm job: 1655703
[INFO] Starting test: Tests.63
[INFO] Running test: Tests.63
[INFO] Submitted slurm job: 1655704
[INFO] Starting test: Tests.64
[INFO] Running test: Tests.64
[INFO] Submitted slurm job: 1655705
[INFO] Starting test: Tests.65
[INFO] Running test: Tests.65
[INFO] Submitted slurm job: 1655706
[INFO] Starting test: Tests.66
[INFO] Running test: Tests.66
[INFO] Submitted slurm job: 1655707
[INFO] Starting test: Tests.67
[INFO] Running test: Tests.67
[INFO] Submitted slurm job: 1655708
[INFO] Starting test: Tests.68
[INFO] Running test: Tests.68
[INFO] Submitted slurm job: 1655709
[INFO] Starting test: Tests.69
[INFO] Running test: Tests.69
[INFO] Submitted slurm job: 1655710
[INFO] Starting test: Tests.70
[INFO] Running test: Tests.70
[INFO] Submitted slurm job: 1655711
[INFO] Starting test: Tests.71
[INFO] Running test: Tests.71
[INFO] Submitted slurm job: 1655712
[INFO] Starting test: Tests.72
[INFO] Running test: Tests.72
[INFO] Submitted slurm job: 1655713
[INFO] Starting test: Tests.73
[INFO] Running test: Tests.73
[INFO] Submitted slurm job: 1655714
[INFO] Starting test: Tests.74
[INFO] Running test: Tests.74
[INFO] Submitted slurm job: 1655715
[INFO] Starting test: Tests.75
[INFO] Running test: Tests.75
[INFO] Submitted slurm job: 1655716
[INFO] Starting test: Tests.76
[INFO] Running test: Tests.76
[INFO] Submitted slurm job: 1655717
[INFO] Starting test: Tests.77
[INFO] Running test: Tests.77
[INFO] Submitted slurm job: 1655718
[INFO] Starting test: Tests.78
[INFO] Running test: Tests.78
[INFO] Submitted slurm job: 1655719
[INFO] Starting test: Tests.79
[INFO] Running test: Tests.79
[INFO] Submitted slurm job: 1655720
[INFO] Starting test: Tests.80
[INFO] Running test: Tests.80
[INFO] Submitted slurm job: 1655721
[INFO] Starting test: Tests.81
[INFO] Running test: Tests.81
[INFO] Submitted slurm job: 1655722
[INFO] Starting test: Tests.82
[INFO] Running test: Tests.82
[INFO] Submitted slurm job: 1655723
[INFO] Starting test: Tests.83
[INFO] Running test: Tests.83
[INFO] Submitted slurm job: 1655724
[INFO] Starting test: Tests.84
[INFO] Running test: Tests.84
[INFO] Submitted slurm job: 1655725
[INFO] Starting test: Tests.85
[INFO] Running test: Tests.85
[INFO] Submitted slurm job: 1655726
[INFO] Starting test: Tests.86
[INFO] Running test: Tests.86
[INFO] Submitted slurm job: 1655727
[INFO] Starting test: Tests.87
[INFO] Running test: Tests.87
[INFO] Submitted slurm job: 1655728
[INFO] Starting test: Tests.88
[INFO] Running test: Tests.88
[INFO] Submitted slurm job: 1655729
[INFO] Starting test: Tests.89
[INFO] Running test: Tests.89
[INFO] Submitted slurm job: 1655730
[INFO] Starting test: Tests.90
[INFO] Running test: Tests.90
[INFO] Submitted slurm job: 1655731
[INFO] Starting test: Tests.91
[INFO] Running test: Tests.91
[INFO] Submitted slurm job: 1655732
[INFO] Starting test: Tests.92
[INFO] Running test: Tests.92
[INFO] Submitted slurm job: 1655733
[INFO] Starting test: Tests.93
[INFO] Running test: Tests.93
[INFO] Submitted slurm job: 1655734
[INFO] Starting test: Tests.94
[INFO] Running test: Tests.94
[INFO] Submitted slurm job: 1655735
[INFO] Starting test: Tests.95
[INFO] Running test: Tests.95
[INFO] Submitted slurm job: 1655736
[INFO] Starting test: Tests.96
[INFO] Running test: Tests.96
[INFO] Submitted slurm job: 1655737
[INFO] Starting test: Tests.97
[INFO] Running test: Tests.97
[INFO] Submitted slurm job: 1655738
[INFO] Starting test: Tests.98
[INFO] Running test: Tests.98
[INFO] Submitted slurm job: 1655739
[INFO] Starting test: Tests.99
[INFO] Running test: Tests.99
[INFO] Submitted slurm job: 1655740
[INFO] Starting test: Tests.100
[INFO] Running test: Tests.100
[INFO] Submitted slurm job: 1655741
[INFO] Starting test: Tests.101
[INFO] Running test: Tests.101
[INFO] Submitted slurm job: 1655742
[INFO] Starting test: Tests.102
[INFO] Running test: Tests.102
[INFO] Submitted slurm job: 1655743
[INFO] Starting test: Tests.103
[INFO] Running test: Tests.103
[INFO] Submitted slurm job: 1655744
[INFO] Starting test: Tests.104
[INFO] Running test: Tests.104
[INFO] Submitted slurm job: 1655745
[INFO] Starting test: Tests.105
[INFO] Running test: Tests.105
[INFO] Submitted slurm job: 1655746
[INFO] Starting test: Tests.106
[INFO] Running test: Tests.106
[INFO] Submitted slurm job: 1655747
[INFO] Starting test: Tests.107
[INFO] Running test: Tests.107
[INFO] Submitted slurm job: 1655748
[INFO] Starting test: Tests.108
[INFO] Running test: Tests.108
[INFO] Submitted slurm job: 1655749
[INFO] Starting test: Tests.109
[INFO] Running test: Tests.109
[INFO] Submitted slurm job: 1655750
[INFO] Starting test: Tests.110
[INFO] Running test: Tests.110
[INFO] Submitted slurm job: 1655751
[INFO] Starting test: Tests.111
[INFO] Running test: Tests.111
[INFO] Submitted slurm job: 1655752
[INFO] Job completed: Tests.9
[INFO] Job completed: Tests.3
[INFO] Job completed: Tests.11
[INFO] Job completed: Tests.15
[INFO] Job completed: Tests.19
[INFO] Job completed: Tests.23
[INFO] Job completed: Tests.13
[INFO] Job completed: Tests.25
[INFO] Job completed: Tests.29
[INFO] Job completed: Tests.37
[INFO] Job completed: Tests.39
[INFO] Job completed: Tests.41
[INFO] Job completed: Tests.1
[INFO] Job completed: Tests.5
[INFO] Job completed: Tests.7
[INFO] Job completed: Tests.17
[INFO] Job completed: Tests.21
[INFO] Job completed: Tests.27
[INFO] Job completed: Tests.31
[INFO] Job completed: Tests.35
[INFO] Job completed: Tests.43
[INFO] Job completed: Tests.47
[INFO] Job completed: Tests.49
[INFO] Job completed: Tests.53
[INFO] Job completed: Tests.61
[INFO] Job completed: Tests.33
[INFO] Job completed: Tests.45
[INFO] Job completed: Tests.63
[INFO] Job completed: Tests.55
[INFO] Job completed: Tests.71
[INFO] Job completed: Tests.73
[INFO] Job completed: Tests.77
[INFO] Job completed: Tests.81
[INFO] Job completed: Tests.83
[INFO] Job completed: Tests.87
[INFO] Job completed: Tests.51
[INFO] Job completed: Tests.67
[INFO] Job completed: Tests.91
[INFO] Job completed: Tests.93
[INFO] Job completed: Tests.57
[INFO] Job completed: Tests.69
[INFO] Job completed: Tests.75
[INFO] Job completed: Tests.97
[INFO] Job completed: Tests.59
[INFO] Job completed: Tests.103
[INFO] Job completed: Tests.107
[INFO] Job completed: Tests.109
[INFO] Job completed: Tests.111
[INFO] Job completed: Tests.2
[INFO] Job completed: Tests.8
[INFO] Job completed: Tests.85
[INFO] Job completed: Tests.14
[INFO] Job completed: Tests.10
[INFO] Job completed: Tests.18
[INFO] Job completed: Tests.22
[INFO] Job completed: Tests.24
[INFO] Job completed: Tests.28
[INFO] Job completed: Tests.36
[INFO] Job completed: Tests.65
[INFO] Job completed: Tests.34
[INFO] Job completed: Tests.38
[INFO] Job completed: Tests.42
[INFO] Job completed: Tests.40
[INFO] Job completed: Tests.46
[INFO] Job completed: Tests.48
[INFO] Job completed: Tests.99
[INFO] Job completed: Tests.52
[INFO] Job completed: Tests.30
[INFO] Job completed: Tests.101
[INFO] Job completed: Tests.60
[INFO] Job completed: Tests.70
[INFO] Job completed: Tests.12
[INFO] Job completed: Tests.16
[INFO] Job completed: Tests.20
[INFO] Job completed: Tests.62
[INFO] Job completed: Tests.72
[INFO] Job completed: Tests.76
[INFO] Job completed: Tests.80
[INFO] Job completed: Tests.79
[INFO] Job completed: Tests.86
[INFO] Job completed: Tests.90
[INFO] Job completed: Tests.92
[INFO] Job completed: Tests.44
[INFO] Job completed: Tests.26
[INFO] Job completed: Tests.54
[INFO] Job completed: Tests.82
[INFO] Job completed: Tests.95
[INFO] Job completed: Tests.89
[INFO] Job completed: Tests.96
[INFO] Job completed: Tests.0
[INFO] Job completed: Tests.66
[INFO] Job completed: Tests.68
[INFO] Job completed: Tests.106
[INFO] Job completed: Tests.102
[INFO] Job completed: Tests.108
[INFO] Job completed: Tests.105
[INFO] Job completed: Tests.110
[INFO] Job completed: Tests.4
[INFO] Job completed: Tests.6
[INFO] Job completed: Tests.74
[INFO] Job completed: Tests.84
[INFO] Job completed: Tests.98
[INFO] Job completed: Tests.100
[INFO] Job completed: Tests.32
[INFO] Job completed: Tests.56
[INFO] Job completed: Tests.58
[INFO] Job completed: Tests.50
[INFO] Job completed: Tests.64
[INFO] Job completed: Tests.88
[INFO] Job completed: Tests.78
[INFO] Job completed: Tests.94
[INFO] Job completed: Tests.104
[INFO] All test scenario results stored at: [REDACTED]
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Additional Notes

Context: Discussion thread.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review December 17, 2024 07:55
Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. SlurmInstaller has PREREQUISITES, let's add sacct there too.
  2. Please add unit tests that reproduces the issue before this change to ensure we won't repeat this problem

@srivatsankrishnan
Copy link
Contributor Author

  1. SlurmInstaller has PREREQUISITES, let's add sacct there too.
  2. Please add unit tests that reproduces the issue before this change to ensure we won't repeat this problem

Fixed 1.

Regarding 2, this is a purely system behavior and not sure how we can capture this in unit test. These commands require squeue and sacct commands which implicitly requiring slurm system. Also the interface did not change at all.

If you have better ideas on capturing it, please follow the slack thread and please feel free to contribute to the unit test to capture this behavior in a seperate PR. I think this can be a good follow up PR and should not block customer request.

However, this PR has been tested with verification team's NCCL test on internal cluster. It has also been stress tested by simultaneously launching 111 jobs on a different production cluster. Without this PR, the original design choice used in CloudAI on checking the job completion would fail for this new customer setup. So given this has been solidly tested on production system (see the test plan), we should approve and merge this.

@amaslenn
Copy link
Contributor

... feel free to contribute to the unit test to capture this behavior in a seperate PR.

We have an agreement to add unit tests for all new features and fixes. Let's stick to this agreement.

Here is how I would approach testing test_is_job_running:

@pytest.mark.parametrize("stdout,is_running", [("RUNNING", True), ("PENDING", True), ("COMPLETED", False)])
def test_is_job_running(stdout: str, is_running: bool, slurm_system: SlurmSystem):
    job = SlurmJob(Mock(), 1)
    pp = Mock()
    pp.communicate = Mock(return_value=(stdout, ""))
    slurm_system.cmd_shell.execute = Mock(return_value=pp)
    assert slurm_system.is_job_running(job) is is_running

Similar approach can be applied for testing with stderr (likely, as a separate test function).

is_job_completed is almost the same (should we move common part into a separate function?) and can be tested the same way.

@srivatsankrishnan
Copy link
Contributor Author

srivatsankrishnan commented Dec 18, 2024

... feel free to contribute to the unit test to capture this behavior in a seperate PR.

We have an agreement to add unit tests for all new features and fixes. Let's stick to this agreement.

Here is how I would approach testing test_is_job_running:

@pytest.mark.parametrize("stdout,is_running", [("RUNNING", True), ("PENDING", True), ("COMPLETED", False)])
def test_is_job_running(stdout: str, is_running: bool, slurm_system: SlurmSystem):
    job = SlurmJob(Mock(), 1)
    pp = Mock()
    pp.communicate = Mock(return_value=(stdout, ""))
    slurm_system.cmd_shell.execute = Mock(return_value=pp)
    assert slurm_system.is_job_running(job) is is_running

Similar approach can be applied for testing with stderr (likely, as a separate test function).

is_job_completed is almost the same (should we move common part into a separate function?) and can be tested the same way.

You are right. I am trying to stick to the agreement here. The authors adding a new feature should also extend the unit test for that feature. However, this is not a new feature. This was a bug fix where in certain high job submission load the CloudAI breaks due to a design choice we made ~7 months ago. Hence, this PR is basically to address and fix this bug. It does not change the interface. To cover these system related issues, I have extensively tested on two clusters (including stressing it based on the customer requirement).

The original design discussion also mentions this issue on why adding more unit test for this system class would not extend the coverage and in fact this bug further validates it.

We agreed and approved this PR (including the comments on CI test plan).

The unit test you proposing also will not provide coverage this behavior as well. I can explain the behavior and maybe you can see if unit testing can support or fake this. Merely capturing the stdout and faking the outputs will not give coverage to this or future system related bug. We need to launch 100+ fake processes and have 1 master process (cloudai executable) and capture its interactions/feedbacks. If you think unit test infra features we have today can also simulate this runtime behavior, please see this as an opportunity to solidify it. But this shouldn't block this PR and should be a separate PR imo.

@amaslenn amaslenn added the bug Something isn't working label Dec 19, 2024
amaslenn
amaslenn previously approved these changes Dec 20, 2024
amaslenn
amaslenn previously approved these changes Dec 20, 2024
@amaslenn amaslenn closed this Dec 20, 2024
@amaslenn amaslenn reopened this Dec 20, 2024
@srivatsankrishnan srivatsankrishnan merged commit 1220577 into main Dec 20, 2024
2 checks passed
@srivatsankrishnan srivatsankrishnan deleted the acceptance branch December 20, 2024 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants