Bug fix for invalid job id for many parallel cloudAI jobs #314

srivatsankrishnan · 2024-12-17T07:52:49Z

Summary

The current cloudAI slurm runner fails when many parallel jobs are submitted. The current strategy is to use squeue with the job id. However, when many CloudAI jobs are submitted in parallel, some of these jobs might complete at the same time resulting in invalid job id error by the time CloudAI queries this status.

This PR fixes this issue by using an alternative way to determine the job completion status instead of squeue. I

Test Plan

CI/CD

Run on real system (Job completion status works and moves on to the next job).

(venv) $ python ./clouda
ix.py run --system-config conf/common/system/xxxx--tests-dir conf/common/test --test-scenario conf/nccl_test_nightly.toml 
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nccl-test
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nccl-test

Section Name: Tests.1
  Test Name: nccl_test_all_reduce
  Description: all_reduce
  No dependencies
Section Name: Tests.2
  Test Name: nccl_test_all_gather
  Description: all_gather
  Start Post Comp: Tests.1
Section Name: Tests.3
  Test Name: nccl_test_reduce_scatter
  Description: reduce_scatter
  Start Post Comp: Tests.2
Section Name: Tests.4
  Test Name: nccl_test_alltoall
  Description: alltoall
  Start Post Comp: Tests.3
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1262768
[INFO] Job completed: Tests.1
[INFO] Delayed start for test Tests.2 by 5 seconds.
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Submitted slurm job: 1262771
[INFO] Job completed: Tests.2
[INFO] Delayed start for test Tests.3 by 5 seconds.
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Submitted slurm job: 1262775

Stress test on another internal cluster with simultaneous 111 job submission using CloudAI

$python utils/dgx.py --auto_hosts --partition batch --max_num_nodes=4 --test_path conf/staging/acceptance_test/test --test_scenario_dir conf/staging/acceptance_test/test_scenario/ --output_dir results --mode run
No config file provided, proceeding with command-line arguments if available.
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: xxxxx
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: xxxx

Section Name: Tests.0
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.1
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.2
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.3
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.4
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.5
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.6
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.7
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.8
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.9
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.10
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.11
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.12
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.13
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.14
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.15
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.16
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.17
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.18
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.19
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.20
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.21
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.22
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.23
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.24
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.25
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.26
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.27
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.28
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.29
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.30
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.31
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.32
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.33
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.34
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.35
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.36
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.37
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.38
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.39
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.40
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.41
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.42
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.43
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.44
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.45
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.46
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.47
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.48
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.49
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.50
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.51
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.52
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.53
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.54
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.55
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.56
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.57
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.58
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.59
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.60
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.61
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.62
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.63
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.64
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.65
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.66
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.67
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.68
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.69
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.70
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.71
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.72
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.73
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.74
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.75
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.76
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.77
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.78
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.79
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.80
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.81
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.82
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.83
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.84
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.85
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.86
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.87
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.88
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.89
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.90
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.91
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.92
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.93
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.94
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.95
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.96
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.97
  Test Name: [REDACTED]
  Description: all_reduce
  No dependencies
Section Name: Tests.98
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.99
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.100
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.101
  Test Name: [REDACTED]
  Description: alltoall
  No dependencies
Section Name: Tests.102
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.103
  Test Name: [REDACTED]
  Description: reduce_scatter
  No dependencies
Section Name: Tests.104
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.105
  Test Name: [REDACTED]
  Description: sendrecv
  No dependencies
Section Name: Tests.106
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.107
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.108
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.109
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.110
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
Section Name: Tests.111
  Test Name: [REDACTED]
  Description: all_gather
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: Tests.0
[INFO] Running test: Tests.0
[INFO] Submitted slurm job: 1655641
[INFO] Starting test: Tests.1
[INFO] Running test: Tests.1
[INFO] Submitted slurm job: 1655642
[INFO] Starting test: Tests.2
[INFO] Running test: Tests.2
[INFO] Submitted slurm job: 1655643
[INFO] Starting test: Tests.3
[INFO] Running test: Tests.3
[INFO] Submitted slurm job: 1655644
[INFO] Starting test: Tests.4
[INFO] Running test: Tests.4
[INFO] Submitted slurm job: 1655645
[INFO] Starting test: Tests.5
[INFO] Running test: Tests.5
[INFO] Submitted slurm job: 1655646
[INFO] Starting test: Tests.6
[INFO] Running test: Tests.6
[INFO] Submitted slurm job: 1655647
[INFO] Starting test: Tests.7
[INFO] Running test: Tests.7
[INFO] Submitted slurm job: 1655648
[INFO] Starting test: Tests.8
[INFO] Running test: Tests.8
[INFO] Submitted slurm job: 1655649
[INFO] Starting test: Tests.9
[INFO] Running test: Tests.9
[INFO] Submitted slurm job: 1655650
[INFO] Starting test: Tests.10
[INFO] Running test: Tests.10
[INFO] Submitted slurm job: 1655651
[INFO] Starting test: Tests.11
[INFO] Running test: Tests.11
[INFO] Submitted slurm job: 1655652
[INFO] Starting test: Tests.12
[INFO] Running test: Tests.12
[INFO] Submitted slurm job: 1655653
[INFO] Starting test: Tests.13
[INFO] Running test: Tests.13
[INFO] Submitted slurm job: 1655654
[INFO] Starting test: Tests.14
[INFO] Running test: Tests.14
[INFO] Submitted slurm job: 1655655
[INFO] Starting test: Tests.15
[INFO] Running test: Tests.15
[INFO] Submitted slurm job: 1655656
[INFO] Starting test: Tests.16
[INFO] Running test: Tests.16
[INFO] Submitted slurm job: 1655657
[INFO] Starting test: Tests.17
[INFO] Running test: Tests.17
[INFO] Submitted slurm job: 1655658
[INFO] Starting test: Tests.18
[INFO] Running test: Tests.18
[INFO] Submitted slurm job: 1655659
[INFO] Starting test: Tests.19
[INFO] Running test: Tests.19
[INFO] Submitted slurm job: 1655660
[INFO] Starting test: Tests.20
[INFO] Running test: Tests.20
[INFO] Submitted slurm job: 1655661
[INFO] Starting test: Tests.21
[INFO] Running test: Tests.21
[INFO] Submitted slurm job: 1655662
[INFO] Starting test: Tests.22
[INFO] Running test: Tests.22
[INFO] Submitted slurm job: 1655663
[INFO] Starting test: Tests.23
[INFO] Running test: Tests.23
[INFO] Submitted slurm job: 1655664
[INFO] Starting test: Tests.24
[INFO] Running test: Tests.24
[INFO] Submitted slurm job: 1655665
[INFO] Starting test: Tests.25
[INFO] Running test: Tests.25
[INFO] Submitted slurm job: 1655666
[INFO] Starting test: Tests.26
[INFO] Running test: Tests.26
[INFO] Submitted slurm job: 1655667
[INFO] Starting test: Tests.27
[INFO] Running test: Tests.27
[INFO] Submitted slurm job: 1655668
[INFO] Starting test: Tests.28
[INFO] Running test: Tests.28
[INFO] Submitted slurm job: 1655669
[INFO] Starting test: Tests.29
[INFO] Running test: Tests.29
[INFO] Submitted slurm job: 1655670
[INFO] Starting test: Tests.30
[INFO] Running test: Tests.30
[INFO] Submitted slurm job: 1655671
[INFO] Starting test: Tests.31
[INFO] Running test: Tests.31
[INFO] Submitted slurm job: 1655672
[INFO] Starting test: Tests.32
[INFO] Running test: Tests.32
[INFO] Submitted slurm job: 1655673
[INFO] Starting test: Tests.33
[INFO] Running test: Tests.33
[INFO] Submitted slurm job: 1655674
[INFO] Starting test: Tests.34
[INFO] Running test: Tests.34
[INFO] Submitted slurm job: 1655675
[INFO] Starting test: Tests.35
[INFO] Running test: Tests.35
[INFO] Submitted slurm job: 1655676
[INFO] Starting test: Tests.36
[INFO] Running test: Tests.36
[INFO] Submitted slurm job: 1655677
[INFO] Starting test: Tests.37
[INFO] Running test: Tests.37
[INFO] Submitted slurm job: 1655678
[INFO] Starting test: Tests.38
[INFO] Running test: Tests.38
[INFO] Submitted slurm job: 1655679
[INFO] Starting test: Tests.39
[INFO] Running test: Tests.39
[INFO] Submitted slurm job: 1655680
[INFO] Starting test: Tests.40
[INFO] Running test: Tests.40
[INFO] Submitted slurm job: 1655681
[INFO] Starting test: Tests.41
[INFO] Running test: Tests.41
[INFO] Submitted slurm job: 1655682
[INFO] Starting test: Tests.42
[INFO] Running test: Tests.42
[INFO] Submitted slurm job: 1655683
[INFO] Starting test: Tests.43
[INFO] Running test: Tests.43
[INFO] Submitted slurm job: 1655684
[INFO] Starting test: Tests.44
[INFO] Running test: Tests.44
[INFO] Submitted slurm job: 1655685
[INFO] Starting test: Tests.45
[INFO] Running test: Tests.45
[INFO] Submitted slurm job: 1655686
[INFO] Starting test: Tests.46
[INFO] Running test: Tests.46
[INFO] Submitted slurm job: 1655687
[INFO] Starting test: Tests.47
[INFO] Running test: Tests.47
[INFO] Submitted slurm job: 1655688
[INFO] Starting test: Tests.48
[INFO] Running test: Tests.48
[INFO] Submitted slurm job: 1655689
[INFO] Starting test: Tests.49
[INFO] Running test: Tests.49
[INFO] Submitted slurm job: 1655690
[INFO] Starting test: Tests.50
[INFO] Running test: Tests.50
[INFO] Submitted slurm job: 1655691
[INFO] Starting test: Tests.51
[INFO] Running test: Tests.51
[INFO] Submitted slurm job: 1655692
[INFO] Starting test: Tests.52
[INFO] Running test: Tests.52
[INFO] Submitted slurm job: 1655693
[INFO] Starting test: Tests.53
[INFO] Running test: Tests.53
[INFO] Submitted slurm job: 1655694
[INFO] Starting test: Tests.54
[INFO] Running test: Tests.54
[INFO] Submitted slurm job: 1655695
[INFO] Starting test: Tests.55
[INFO] Running test: Tests.55
[INFO] Submitted slurm job: 1655696
[INFO] Starting test: Tests.56
[INFO] Running test: Tests.56
[INFO] Submitted slurm job: 1655697
[INFO] Starting test: Tests.57
[INFO] Running test: Tests.57
[INFO] Submitted slurm job: 1655698
[INFO] Starting test: Tests.58
[INFO] Running test: Tests.58
[INFO] Submitted slurm job: 1655699
[INFO] Starting test: Tests.59
[INFO] Running test: Tests.59
[INFO] Submitted slurm job: 1655700
[INFO] Starting test: Tests.60
[INFO] Running test: Tests.60
[INFO] Submitted slurm job: 1655701
[INFO] Starting test: Tests.61
[INFO] Running test: Tests.61
[INFO] Submitted slurm job: 1655702
[INFO] Starting test: Tests.62
[INFO] Running test: Tests.62
[INFO] Submitted slurm job: 1655703
[INFO] Starting test: Tests.63
[INFO] Running test: Tests.63
[INFO] Submitted slurm job: 1655704
[INFO] Starting test: Tests.64
[INFO] Running test: Tests.64
[INFO] Submitted slurm job: 1655705
[INFO] Starting test: Tests.65
[INFO] Running test: Tests.65
[INFO] Submitted slurm job: 1655706
[INFO] Starting test: Tests.66
[INFO] Running test: Tests.66
[INFO] Submitted slurm job: 1655707
[INFO] Starting test: Tests.67
[INFO] Running test: Tests.67
[INFO] Submitted slurm job: 1655708
[INFO] Starting test: Tests.68
[INFO] Running test: Tests.68
[INFO] Submitted slurm job: 1655709
[INFO] Starting test: Tests.69
[INFO] Running test: Tests.69
[INFO] Submitted slurm job: 1655710
[INFO] Starting test: Tests.70
[INFO] Running test: Tests.70
[INFO] Submitted slurm job: 1655711
[INFO] Starting test: Tests.71
[INFO] Running test: Tests.71
[INFO] Submitted slurm job: 1655712
[INFO] Starting test: Tests.72
[INFO] Running test: Tests.72
[INFO] Submitted slurm job: 1655713
[INFO] Starting test: Tests.73
[INFO] Running test: Tests.73
[INFO] Submitted slurm job: 1655714
[INFO] Starting test: Tests.74
[INFO] Running test: Tests.74
[INFO] Submitted slurm job: 1655715
[INFO] Starting test: Tests.75
[INFO] Running test: Tests.75
[INFO] Submitted slurm job: 1655716
[INFO] Starting test: Tests.76
[INFO] Running test: Tests.76
[INFO] Submitted slurm job: 1655717
[INFO] Starting test: Tests.77
[INFO] Running test: Tests.77
[INFO] Submitted slurm job: 1655718
[INFO] Starting test: Tests.78
[INFO] Running test: Tests.78
[INFO] Submitted slurm job: 1655719
[INFO] Starting test: Tests.79
[INFO] Running test: Tests.79
[INFO] Submitted slurm job: 1655720
[INFO] Starting test: Tests.80
[INFO] Running test: Tests.80
[INFO] Submitted slurm job: 1655721
[INFO] Starting test: Tests.81
[INFO] Running test: Tests.81
[INFO] Submitted slurm job: 1655722
[INFO] Starting test: Tests.82
[INFO] Running test: Tests.82
[INFO] Submitted slurm job: 1655723
[INFO] Starting test: Tests.83
[INFO] Running test: Tests.83
[INFO] Submitted slurm job: 1655724
[INFO] Starting test: Tests.84
[INFO] Running test: Tests.84
[INFO] Submitted slurm job: 1655725
[INFO] Starting test: Tests.85
[INFO] Running test: Tests.85
[INFO] Submitted slurm job: 1655726
[INFO] Starting test: Tests.86
[INFO] Running test: Tests.86
[INFO] Submitted slurm job: 1655727
[INFO] Starting test: Tests.87
[INFO] Running test: Tests.87
[INFO] Submitted slurm job: 1655728
[INFO] Starting test: Tests.88
[INFO] Running test: Tests.88
[INFO] Submitted slurm job: 1655729
[INFO] Starting test: Tests.89
[INFO] Running test: Tests.89
[INFO] Submitted slurm job: 1655730
[INFO] Starting test: Tests.90
[INFO] Running test: Tests.90
[INFO] Submitted slurm job: 1655731
[INFO] Starting test: Tests.91
[INFO] Running test: Tests.91
[INFO] Submitted slurm job: 1655732
[INFO] Starting test: Tests.92
[INFO] Running test: Tests.92
[INFO] Submitted slurm job: 1655733
[INFO] Starting test: Tests.93
[INFO] Running test: Tests.93
[INFO] Submitted slurm job: 1655734
[INFO] Starting test: Tests.94
[INFO] Running test: Tests.94
[INFO] Submitted slurm job: 1655735
[INFO] Starting test: Tests.95
[INFO] Running test: Tests.95
[INFO] Submitted slurm job: 1655736
[INFO] Starting test: Tests.96
[INFO] Running test: Tests.96
[INFO] Submitted slurm job: 1655737
[INFO] Starting test: Tests.97
[INFO] Running test: Tests.97
[INFO] Submitted slurm job: 1655738
[INFO] Starting test: Tests.98
[INFO] Running test: Tests.98
[INFO] Submitted slurm job: 1655739
[INFO] Starting test: Tests.99
[INFO] Running test: Tests.99
[INFO] Submitted slurm job: 1655740
[INFO] Starting test: Tests.100
[INFO] Running test: Tests.100
[INFO] Submitted slurm job: 1655741
[INFO] Starting test: Tests.101
[INFO] Running test: Tests.101
[INFO] Submitted slurm job: 1655742
[INFO] Starting test: Tests.102
[INFO] Running test: Tests.102
[INFO] Submitted slurm job: 1655743
[INFO] Starting test: Tests.103
[INFO] Running test: Tests.103
[INFO] Submitted slurm job: 1655744
[INFO] Starting test: Tests.104
[INFO] Running test: Tests.104
[INFO] Submitted slurm job: 1655745
[INFO] Starting test: Tests.105
[INFO] Running test: Tests.105
[INFO] Submitted slurm job: 1655746
[INFO] Starting test: Tests.106
[INFO] Running test: Tests.106
[INFO] Submitted slurm job: 1655747
[INFO] Starting test: Tests.107
[INFO] Running test: Tests.107
[INFO] Submitted slurm job: 1655748
[INFO] Starting test: Tests.108
[INFO] Running test: Tests.108
[INFO] Submitted slurm job: 1655749
[INFO] Starting test: Tests.109
[INFO] Running test: Tests.109
[INFO] Submitted slurm job: 1655750
[INFO] Starting test: Tests.110
[INFO] Running test: Tests.110
[INFO] Submitted slurm job: 1655751
[INFO] Starting test: Tests.111
[INFO] Running test: Tests.111
[INFO] Submitted slurm job: 1655752
[INFO] Job completed: Tests.9
[INFO] Job completed: Tests.3
[INFO] Job completed: Tests.11
[INFO] Job completed: Tests.15
[INFO] Job completed: Tests.19
[INFO] Job completed: Tests.23
[INFO] Job completed: Tests.13
[INFO] Job completed: Tests.25
[INFO] Job completed: Tests.29
[INFO] Job completed: Tests.37
[INFO] Job completed: Tests.39
[INFO] Job completed: Tests.41
[INFO] Job completed: Tests.1
[INFO] Job completed: Tests.5
[INFO] Job completed: Tests.7
[INFO] Job completed: Tests.17
[INFO] Job completed: Tests.21
[INFO] Job completed: Tests.27
[INFO] Job completed: Tests.31
[INFO] Job completed: Tests.35
[INFO] Job completed: Tests.43
[INFO] Job completed: Tests.47
[INFO] Job completed: Tests.49
[INFO] Job completed: Tests.53
[INFO] Job completed: Tests.61
[INFO] Job completed: Tests.33
[INFO] Job completed: Tests.45
[INFO] Job completed: Tests.63
[INFO] Job completed: Tests.55
[INFO] Job completed: Tests.71
[INFO] Job completed: Tests.73
[INFO] Job completed: Tests.77
[INFO] Job completed: Tests.81
[INFO] Job completed: Tests.83
[INFO] Job completed: Tests.87
[INFO] Job completed: Tests.51
[INFO] Job completed: Tests.67
[INFO] Job completed: Tests.91
[INFO] Job completed: Tests.93
[INFO] Job completed: Tests.57
[INFO] Job completed: Tests.69
[INFO] Job completed: Tests.75
[INFO] Job completed: Tests.97
[INFO] Job completed: Tests.59
[INFO] Job completed: Tests.103
[INFO] Job completed: Tests.107
[INFO] Job completed: Tests.109
[INFO] Job completed: Tests.111
[INFO] Job completed: Tests.2
[INFO] Job completed: Tests.8
[INFO] Job completed: Tests.85
[INFO] Job completed: Tests.14
[INFO] Job completed: Tests.10
[INFO] Job completed: Tests.18
[INFO] Job completed: Tests.22
[INFO] Job completed: Tests.24
[INFO] Job completed: Tests.28
[INFO] Job completed: Tests.36
[INFO] Job completed: Tests.65
[INFO] Job completed: Tests.34
[INFO] Job completed: Tests.38
[INFO] Job completed: Tests.42
[INFO] Job completed: Tests.40
[INFO] Job completed: Tests.46
[INFO] Job completed: Tests.48
[INFO] Job completed: Tests.99
[INFO] Job completed: Tests.52
[INFO] Job completed: Tests.30
[INFO] Job completed: Tests.101
[INFO] Job completed: Tests.60
[INFO] Job completed: Tests.70
[INFO] Job completed: Tests.12
[INFO] Job completed: Tests.16
[INFO] Job completed: Tests.20
[INFO] Job completed: Tests.62
[INFO] Job completed: Tests.72
[INFO] Job completed: Tests.76
[INFO] Job completed: Tests.80
[INFO] Job completed: Tests.79
[INFO] Job completed: Tests.86
[INFO] Job completed: Tests.90
[INFO] Job completed: Tests.92
[INFO] Job completed: Tests.44
[INFO] Job completed: Tests.26
[INFO] Job completed: Tests.54
[INFO] Job completed: Tests.82
[INFO] Job completed: Tests.95
[INFO] Job completed: Tests.89
[INFO] Job completed: Tests.96
[INFO] Job completed: Tests.0
[INFO] Job completed: Tests.66
[INFO] Job completed: Tests.68
[INFO] Job completed: Tests.106
[INFO] Job completed: Tests.102
[INFO] Job completed: Tests.108
[INFO] Job completed: Tests.105
[INFO] Job completed: Tests.110
[INFO] Job completed: Tests.4
[INFO] Job completed: Tests.6
[INFO] Job completed: Tests.74
[INFO] Job completed: Tests.84
[INFO] Job completed: Tests.98
[INFO] Job completed: Tests.100
[INFO] Job completed: Tests.32
[INFO] Job completed: Tests.56
[INFO] Job completed: Tests.58
[INFO] Job completed: Tests.50
[INFO] Job completed: Tests.64
[INFO] Job completed: Tests.88
[INFO] Job completed: Tests.78
[INFO] Job completed: Tests.94
[INFO] Job completed: Tests.104
[INFO] All test scenario results stored at: [REDACTED]
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Additional Notes

Context: Discussion thread.

amaslenn

SlurmInstaller has PREREQUISITES, let's add sacct there too.
Please add unit tests that reproduces the issue before this change to ensure we won't repeat this problem

srivatsankrishnan · 2024-12-18T00:54:19Z

SlurmInstaller has PREREQUISITES, let's add sacct there too.

Please add unit tests that reproduces the issue before this change to ensure we won't repeat this problem

Fixed 1.

Regarding 2, this is a purely system behavior and not sure how we can capture this in unit test. These commands require squeue and sacct commands which implicitly requiring slurm system. Also the interface did not change at all.

If you have better ideas on capturing it, please follow the slack thread and please feel free to contribute to the unit test to capture this behavior in a seperate PR. I think this can be a good follow up PR and should not block customer request.

However, this PR has been tested with verification team's NCCL test on internal cluster. It has also been stress tested by simultaneously launching 111 jobs on a different production cluster. Without this PR, the original design choice used in CloudAI on checking the job completion would fail for this new customer setup. So given this has been solidly tested on production system (see the test plan), we should approve and merge this.

amaslenn · 2024-12-18T13:51:57Z

... feel free to contribute to the unit test to capture this behavior in a seperate PR.

We have an agreement to add unit tests for all new features and fixes. Let's stick to this agreement.

Here is how I would approach testing test_is_job_running:

@pytest.mark.parametrize("stdout,is_running", [("RUNNING", True), ("PENDING", True), ("COMPLETED", False)])
def test_is_job_running(stdout: str, is_running: bool, slurm_system: SlurmSystem):
    job = SlurmJob(Mock(), 1)
    pp = Mock()
    pp.communicate = Mock(return_value=(stdout, ""))
    slurm_system.cmd_shell.execute = Mock(return_value=pp)
    assert slurm_system.is_job_running(job) is is_running

Similar approach can be applied for testing with stderr (likely, as a separate test function).

is_job_completed is almost the same (should we move common part into a separate function?) and can be tested the same way.

srivatsankrishnan · 2024-12-18T16:42:23Z

... feel free to contribute to the unit test to capture this behavior in a seperate PR.

We have an agreement to add unit tests for all new features and fixes. Let's stick to this agreement.

Here is how I would approach testing test_is_job_running:
@pytest.mark.parametrize("stdout,is_running", [("RUNNING", True), ("PENDING", True), ("COMPLETED", False)])
def test_is_job_running(stdout: str, is_running: bool, slurm_system: SlurmSystem):
    job = SlurmJob(Mock(), 1)
    pp = Mock()
    pp.communicate = Mock(return_value=(stdout, ""))
    slurm_system.cmd_shell.execute = Mock(return_value=pp)
    assert slurm_system.is_job_running(job) is is_running
Similar approach can be applied for testing with stderr (likely, as a separate test function).

is_job_completed is almost the same (should we move common part into a separate function?) and can be tested the same way.

You are right. I am trying to stick to the agreement here. The authors adding a new feature should also extend the unit test for that feature. However, this is not a new feature. This was a bug fix where in certain high job submission load the CloudAI breaks due to a design choice we made ~7 months ago. Hence, this PR is basically to address and fix this bug. It does not change the interface. To cover these system related issues, I have extensively tested on two clusters (including stressing it based on the customer requirement).

The original design discussion also mentions this issue on why adding more unit test for this system class would not extend the coverage and in fact this bug further validates it.

We agreed and approved this PR (including the comments on CI test plan).

The unit test you proposing also will not provide coverage this behavior as well. I can explain the behavior and maybe you can see if unit testing can support or fake this. Merely capturing the stdout and faking the outputs will not give coverage to this or future system related bug. We need to launch 100+ fake processes and have 1 master process (cloudai executable) and capture its interactions/feedbacks. If you think unit test infra features we have today can also simulate this runtime behavior, please see this as an opportunity to solidify it. But this shouldn't block this PR and should be a separate PR imo.

srivatsankrishnan marked this pull request as ready for review December 17, 2024 07:55

amaslenn reviewed Dec 17, 2024

View reviewed changes

amaslenn added the bug Something isn't working label Dec 19, 2024

amaslenn previously approved these changes Dec 20, 2024

View reviewed changes

amaslenn dismissed their stale review via 5cb4c51 December 20, 2024 16:24

amaslenn previously approved these changes Dec 20, 2024

View reviewed changes

amaslenn closed this Dec 20, 2024

amaslenn reopened this Dec 20, 2024

srivatsankrishnan added 5 commits December 20, 2024 08:33

Bug fix for invalid job id for parallel cloudAI jobs

d05f9c9

merge conflict

4e99577

Adding unit tests because its unit test

d66c8be

Ruff

d646c28

clean

2bbfc15

srivatsankrishnan dismissed amaslenn’s stale review via 2bbfc15 December 20, 2024 16:35

srivatsankrishnan force-pushed the acceptance branch from 5cb4c51 to 2bbfc15 Compare December 20, 2024 16:35

Make ruff happy

6e5621f

amaslenn approved these changes Dec 20, 2024

View reviewed changes

srivatsankrishnan merged commit 1220577 into main Dec 20, 2024
2 checks passed

srivatsankrishnan deleted the acceptance branch December 20, 2024 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix for invalid job id for many parallel cloudAI jobs #314

Bug fix for invalid job id for many parallel cloudAI jobs #314

srivatsankrishnan commented Dec 17, 2024 •

edited

Loading

amaslenn left a comment

srivatsankrishnan commented Dec 18, 2024

amaslenn commented Dec 18, 2024

srivatsankrishnan commented Dec 18, 2024 •

edited

Loading

Bug fix for invalid job id for many parallel cloudAI jobs #314

Bug fix for invalid job id for many parallel cloudAI jobs #314

Conversation

srivatsankrishnan commented Dec 17, 2024 • edited Loading

Summary

Test Plan

Additional Notes

amaslenn left a comment

Choose a reason for hiding this comment

srivatsankrishnan commented Dec 18, 2024

amaslenn commented Dec 18, 2024

srivatsankrishnan commented Dec 18, 2024 • edited Loading

srivatsankrishnan commented Dec 17, 2024 •

edited

Loading

srivatsankrishnan commented Dec 18, 2024 •

edited

Loading