Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix in job completion checks #340

Merged
merged 2 commits into from
Jan 15, 2025
Merged

Conversation

TaekyungHeo
Copy link
Member

@TaekyungHeo TaekyungHeo commented Jan 15, 2025

Summary

This is a bug fix for https://redmine.mellanox.com/issues/4248881. @Bohatchuk reported that Chakra replay fails to generate reports when pre-test hooks are enabled. This issue arises from how job completion is checked. Currently, the system checks for specific keywords such as "COMPLETED" in the output of sacct. However, when multiple tasks exist within a single job, the output may include a mix of "RUNNING" and "COMPLETED," as shown below:

RUNNING
COMPLETED

This scenario should be interpreted as the job still running. Otherwise, jobs may be prematurely marked as complete.

Test Plan

  1. CI passes
  2. Ran on a server (EOS)
$ python ./cloudaix.py --log-level DEBUG --log-file run_chakra_replay_3.log run --system-config 
conf/common/system/eos.toml --tests-dir conf/common/test --test-scenario  conf/devops/verification/test_scenario/chakra_replay.toml  
...                                                                 
[INFO] Test Scenario: chakra_replay                                                                                                                                    
                                                                                                                                                                       
Section Name: Tests.1                                                                                                                                                  
  Test Name: chakra_replay                                                                                                                                             
  Description: chakra_replay                                                                                                                                           
  No dependencies                                                                                                                                                      
[INFO] Initializing Runner [RUN] mode                                                                                                                                  
[INFO] Creating SlurmRunner                                                                                                                                            
[DEBUG] SlurmRunner initialized                                                                                                                                        
[INFO] Starting test scenario execution.                                                                                                                               
[INFO] Starting test: Tests.1                                                                                                                                          
[INFO] Running test: Tests.1                                                                                                                                           
[DEBUG] Executing command for test Tests.1: sbatch results/chakra_replay_2025-01-14_16-35-35/Tests.1/0/cloudai_sbatch_script.sh                                        
[INFO] Submitted slurm job: 1828634   

After job completion

$ ls $RESULTS
chakra_replay_report.html  cloudai_sbatch_script.sh  stderr.txt  stdout.txt

pre_test:
nccl_test_all_gather

Before pre-test hook completion

   RUNNING
   RUNNING
   RUNNING
   RUNNING
   RUNNING

After pre-test hook completion

   RUNNING
   RUNNING
   RUNNING
 COMPLETED
   RUNNING

@TaekyungHeo TaekyungHeo added the bug Something isn't working label Jan 15, 2025
@TaekyungHeo TaekyungHeo merged commit d2cc9b4 into NVIDIA:main Jan 15, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants