Add slurm backend #1

bouthilx · 2018-03-29T20:28:06Z

No description provided.

It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.

…andle slurm clusters

Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].

It will need many conversions, not only on resources, so better make it clean.

Slurm has no queues, so PBS option -q is invalid and non-convertible.

$PBS_JOBID was used to set the stdout/err of the job as well as in the commands. Replace them with $SLURM_JOB_ID. Also, workers were accessing os.environ[PBS_JOBID] so we added a second fetch on SLURM_JOB_ID in case os.environ[PBS_JOBID] gave undefined.

Slurm cannot be passed environment variables defined locally on command-line like PBS_FILENAME is. To bypass this, we add a definition in the prolog, making PBS_FILENAME available to all commands and epilog. NOTE: We leave PBS_FILENAME definition in command-line too such that any user using $PBS_FILENAME inside a custom pbsFlag can still do so.

PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.

Slurm does not have a equivalent environment variable set like PBS_WALLTIME. To avoid confusion, all variables PBS_WALLTIME are renamed to SBATCH_TIMELIMIT (the environment variable one would use to set --time with sbatch). As SBATCH_TIMELIMIT is not set automatically, we add it to the prolog to make it available to all commands and epilog. NOTE: PBS_WALLTIME is set in seconds, but we only have HH:MM:SS-like strings at the time of building the PBS file. We needed to add a walltime_to_seconds helper function to convert HH:MM:SS like strings into seconds, so that SBATCH_TIMELIMIT is set with seconds like PBS_WALLTIME.

It is possible to query the system to see if some commands are available using distutils.spawn.find_executable(command_name). Clusters where more than one launcher are available will still get launchers selected based on string matching. For instance, get_launcher("helios") would always return msub no matter what is available on the system.

JobGenerators are selected by job_generator_factory based on the cluster's name. We use a more flexible, duck typing approach for Slurm clusters. If cluster name is not known, or not any of the if-case clauses in the factory, then we look at which launchers are available in the system. If it is sbatch, then a SlurmJobGenerator is built, a JobGenerator otherwise.

The command `sacctmgr` fails on some computers (mila01 namely), but the current behavior gives the impression sbatch is simply not available. Printing the stderr makes it more obvious that sbatch should be available, but something is broken behind sacctmgr. It only appears when using -vv options nevertheless.

Adding a script to do automatic verifications to assert validity of the current code. The verifications are not automatic unit-tests, they need automatically checks that the process executed successfully, but the administrator still needs to verify manually, reading the logs, that the requested resources were provided. Verifications can easily be combined, building on top of each others, from complex ones to simpler ones. Here is a list of all the verification currently implemented for slurm clusters: 1. very_simple_task (1 CPU) 2. verify_simple_task_with_one_gpu (1 CPU 1 GPU) 3. verify_simple_task_with_many_gpus (1 CPU X GPU) 4. verify_many_task (X CPU) 5. verify_many_task_with_many_cores (XY CPU) 6. verify_many_task_with_one_gpu (X CPU X GPU) 7. verify_many_task_with_many_gpus (X CPU Y GPU) 8. verify_simple_task_with_autoresume_unneeded (1 CPU) 9. verify_simple_task_with_autoresume_needed (1 CPU) 10. verify_many_task_with_autoresume_needed (X CPU)

My initial though was that get_launcher should raise an error when no launcher is found on the system since there cannot be any job launcher. I realized that this would break the --doNotLaunch option that users may want to use on system with no launcher, just to create the files.

The tests were failing because the account was not specified.

The tests were failing because the account was not specified

There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.

Nodes on Cedar do not have access to slurmdb for instance. We can turn to $CC_CLUSTER, but sacctmgr is still the more flexible solution. Add a look_up_cluster_name_env_var helper function to look at specific environment variables for cluster names. Current ones are CC_CLUSTER for compute canada clusters and CLUSTER for calcul quebec.

Why: Currently, calls with Popen in utils functions to detect cluster would crash with no informative error messages if the commandline executed crashed.

Note: used 2to3 + manual verifications prior commit.

bouthilx and others added 30 commits October 17, 2017 20:50

Add verbosity to smart-dispatch

329efce

It is difficult to debug resuming while important process are taking place in the pbs script automatically built by SmartDispatch. We add verbose to smart-dispatch script and add debugging prints in epilog.

Added test for slurm integration

3b8919a

New test for priority

8ce0b25

Added gres + memory tests

8dc8e0a

Refactored tests

1e67a22

small update

6a25263

Python3 compatibility + PR comments

b34ff36

Fixed naccelerators issue

eb4d473

Updated tests to skip on Graham and Cedar smartdispatch modified to h…

c2f2de6

…andle slurm clusters

Cleaned code with PR feedback

dcf1504

Updated tests

23e8a01

Updated tests using mock

93a4a3a

Refactor detect_cluster tests

6952922

Small changes in TestSlurmQueue

60610b0

Fix add_sbatch_option bug

39fa04e

Why: For each option, add_sbatch_option would add the option in both the form --[OPTION_NAME] and [OPTION_NAME].

Refactor SlurmJobGenerator

255920c

It will need many conversions, not only on resources, so better make it clean.

Remove queue name for Slurm clusters

ce370bc

Slurm has no queues, so PBS option -q is invalid and non-convertible.

Fix env var export option for Slurm

22cfb38

PBS options -V is not converted properly to SBATCH --export ALL. We remove it and replace it with --export=ALL is the sbatch options.

Add sbatch to command-line launcher options

0b05a94

Updated documentation for slurm clusters

8565d69

Add verification script for cedar

1112ec9

Add verification script for graham

8502e22

Add verification script for mila

5baabe1

bouthilx and others added 8 commits October 17, 2017 20:56

Updated README

eb91544

Set properly account in verify_graham

14c5819

The tests were failing because the account was not specified.

Set properly account in verify_cedar

fde46db

The tests were failing because the account was not specified

Fix walltime_to_seconds convertion

33c048b

There was a missing parentheses which was causing a bad conversion of "DD:HH:MM:SS" to seconds. The unit-test was also missing the same parentheses. I added a unit-test to make sure such error could not occur again.

Retrieve subprocess' error to improve debugging

7dc2307

Why: Currently, calls with Popen in utils functions to detect cluster would crash with no informative error messages if the commandline executed crashed.

Adapt to python3

9537a0f

Note: used 2to3 + manual verifications prior commit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add slurm backend #1

Add slurm backend #1

bouthilx commented Mar 29, 2018

Add slurm backend #1

Are you sure you want to change the base?

Add slurm backend #1

Conversation

bouthilx commented Mar 29, 2018