retry docker image enroot when cluster requires specifying GPU resource #335

lilyw97 · 2025-01-13T07:43:46Z

Summary

Currently the docker image enroot command only contains "account", "partition", "docker image url"
But some clusters may require explicit specify the GPU resource by using --gpu=N, like CW-DFW

Test Plan

Use Cloudaix to do the testing

Login into CW-DFW data copy node or login node
Setup Cloudaix env
Clone this branch, and install cloudai manually
Run: python cloudaix.py install --tests-dir ./conf/staging/dgx_cloud_acceptance_test/test/ --system-config ./conf/common/system/coreweave.toml
Confirm if the docker image has been downloaded to ./install successfully

Additional Notes

--gpu=1 will only be added into enroot command when the original command got Cannot find GPU specification error

amaslenn · 2025-01-13T09:28:28Z

src/cloudai/util/docker_image_cache_manager.py

+        except subprocess.CalledProcessError as e:
+            # Retry enroot by adding `--gpus=1`` if cluster requires specifing GPU resource
+            if not retry and e.stderr and "Cannot find GPU specification" in e.stderr:
+                srun_prefix += " --gpus=1"


Does it hurt if --gpus is always set? Will that work for IL1 or EOS? Alternatively, can we use SlurmSystem.gpus_per_node value to check if --gpus should be specified?

Main idea is to construct CMD correctly from the start and do not retry based on an error.

P.S. thanks for moving code into own function, it totally make sense.
P.P.S. please extend unit tests (tests/test_docker_image_cache_manager.py), you'll find many examples there.

It hurts. EOS fails if --gpus is set unfortunatelly.

I am asking EoS team to see why --gpus makes error
SlurmSystem.gpus_per_node might works, below are what I gets from EoS and DFW

DFW:
(venv) lilyw@cw-dfw-cs-001-login-01:/lustre/fsw/portfolios/general/users/lilyw/cloudaix$ sinfo -eO "NodeList,Gres"
NODELIST GRES
cw-dfw-h100-001-004-gpu:8
cw-dfw-cpu1-003-017-(null)

EoS:
lilyw@login-eos01:/lustre/fsw/sw_aidot/lilyw$ sinfo -eO "NodeList,Gres"
NODELIST GRES
eos[0001-0096,0113-0(null)

Seems EoS doesn't specify any GRES to all nodes, so --gpus may fail
And DFW has specified GRES, so --gpus works

CloudAI is using SlurmSystem.gpus_per_node for sbatch scripts generation:

... if self.system.gpus_per_node: batch_script_content.append(f"#SBATCH --gpus-per-node={self.system.gpus_per_node}") batch_script_content.append(f"#SBATCH --gres=gpu:{self.system.gpus_per_node}") ...

So if we are relying on it for sbatch, it make sense to use it for srun as well. Please double check on IL1.

I see, so system.gpus_per_node relies on the same parameter set in input system.toml file

And I think those parameters in system.toml file are being set manually? in cloudaix, IL1 and EoS doesn't have gpus_per_node, and Corewave has it

Wondering is there a formal way to determine the parameters for different clusters when generating system.toml?
cc += @srivatsankrishnan

I will test on IL1 once I got the cluster permission

From my understanding, such requirements are per-cluster and there is no way to know that upfront.

@lilyw97 ,

Please use the gpus_per_node field in the system schema, and add the GPU option when it is presented.

Please ensure that your PR works on IL-1, EOS, and Coreweave.

…require in command line

Lily Wang and others added 3 commits January 12, 2025 23:09

retry docker image enroot when cluster requires specifying GPU resource

5b674ff

fix format issue

720d062

fix type error

9e03b9b

amaslenn reviewed Jan 13, 2025

View reviewed changes

TaekyungHeo requested review from srivatsankrishnan and srinivas212 January 14, 2025 20:52

TaekyungHeo added the bug Something isn't working label Jan 14, 2025

use gpus_per_node in system file instead of using try catch for gres …

589f2ae

…require in command line

lilyw97 force-pushed the fix/ImageEnrootWithGPUsSpecified branch from be75b0d to 589f2ae Compare January 15, 2025 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry docker image enroot when cluster requires specifying GPU resource #335

retry docker image enroot when cluster requires specifying GPU resource #335

lilyw97 commented Jan 13, 2025

amaslenn Jan 13, 2025

aazzolini Jan 13, 2025

lilyw97 Jan 14, 2025 •

edited

Loading

amaslenn Jan 14, 2025

lilyw97 Jan 14, 2025 •

edited

Loading

amaslenn Jan 14, 2025

TaekyungHeo Jan 14, 2025 •

edited

Loading

retry docker image enroot when cluster requires specifying GPU resource #335

Are you sure you want to change the base?

retry docker image enroot when cluster requires specifying GPU resource #335

Conversation

lilyw97 commented Jan 13, 2025

Summary

Test Plan

Additional Notes

amaslenn Jan 13, 2025

Choose a reason for hiding this comment

aazzolini Jan 13, 2025

Choose a reason for hiding this comment

lilyw97 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

amaslenn Jan 14, 2025

Choose a reason for hiding this comment

lilyw97 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

amaslenn Jan 14, 2025

Choose a reason for hiding this comment

TaekyungHeo Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

lilyw97 Jan 14, 2025 •

edited

Loading

lilyw97 Jan 14, 2025 •

edited

Loading

TaekyungHeo Jan 14, 2025 •

edited

Loading