Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry docker image enroot when cluster requires specifying GPU resource #335

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lilyw97
Copy link

@lilyw97 lilyw97 commented Jan 13, 2025

Summary

Currently the docker image enroot command only contains "account", "partition", "docker image url"
But some clusters may require explicit specify the GPU resource by using --gpu=N, like CW-DFW

Test Plan

Use Cloudaix to do the testing

  • Login into CW-DFW data copy node or login node
  • Setup Cloudaix env
  • Clone this branch, and install cloudai manually
  • Run: python cloudaix.py install --tests-dir ./conf/staging/dgx_cloud_acceptance_test/test/ --system-config ./conf/common/system/coreweave.toml
  • Confirm if the docker image has been downloaded to ./install successfully

Additional Notes

--gpu=1 will only be added into enroot command when the original command got Cannot find GPU specification error

except subprocess.CalledProcessError as e:
# Retry enroot by adding `--gpus=1`` if cluster requires specifing GPU resource
if not retry and e.stderr and "Cannot find GPU specification" in e.stderr:
srun_prefix += " --gpus=1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it hurt if --gpus is always set? Will that work for IL1 or EOS? Alternatively, can we use SlurmSystem.gpus_per_node value to check if --gpus should be specified?

Main idea is to construct CMD correctly from the start and do not retry based on an error.

P.S. thanks for moving code into own function, it totally make sense.
P.P.S. please extend unit tests (tests/test_docker_image_cache_manager.py), you'll find many examples there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hurts. EOS fails if --gpus is set unfortunatelly.

Copy link
Author

@lilyw97 lilyw97 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking EoS team to see why --gpus makes error
SlurmSystem.gpus_per_node might works, below are what I gets from EoS and DFW

DFW:
(venv) lilyw@cw-dfw-cs-001-login-01:/lustre/fsw/portfolios/general/users/lilyw/cloudaix$ sinfo -eO "NodeList,Gres"
NODELIST GRES
cw-dfw-h100-001-004-gpu:8
cw-dfw-cpu1-003-017-(null)

EoS:
lilyw@login-eos01:/lustre/fsw/sw_aidot/lilyw$ sinfo -eO "NodeList,Gres"
NODELIST GRES
eos[0001-0096,0113-0(null)

Seems EoS doesn't specify any GRES to all nodes, so --gpus may fail
And DFW has specified GRES, so --gpus works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloudAI is using SlurmSystem.gpus_per_node for sbatch scripts generation:

...
if self.system.gpus_per_node:
    batch_script_content.append(f"#SBATCH --gpus-per-node={self.system.gpus_per_node}")
    batch_script_content.append(f"#SBATCH --gres=gpu:{self.system.gpus_per_node}")
...

So if we are relying on it for sbatch, it make sense to use it for srun as well. Please double check on IL1.

Copy link
Author

@lilyw97 lilyw97 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so system.gpus_per_node relies on the same parameter set in input system.toml file

And I think those parameters in system.toml file are being set manually? in cloudaix, IL1 and EoS doesn't have gpus_per_node, and Corewave has it

Wondering is there a formal way to determine the parameters for different clusters when generating system.toml?
cc += @srivatsankrishnan

I will test on IL1 once I got the cluster permission

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, such requirements are per-cluster and there is no way to know that upfront.

Copy link
Member

@TaekyungHeo TaekyungHeo Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lilyw97 ,

  • Please use the gpus_per_node field in the system schema, and add the GPU option when it is presented.
  • Please ensure that your PR works on IL-1, EOS, and Coreweave.

@TaekyungHeo TaekyungHeo added the bug Something isn't working label Jan 14, 2025
@lilyw97 lilyw97 force-pushed the fix/ImageEnrootWithGPUsSpecified branch from be75b0d to 589f2ae Compare January 15, 2025 08:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants