-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
retry docker image enroot when cluster requires specifying GPU resource #335
base: main
Are you sure you want to change the base?
Conversation
except subprocess.CalledProcessError as e: | ||
# Retry enroot by adding `--gpus=1`` if cluster requires specifing GPU resource | ||
if not retry and e.stderr and "Cannot find GPU specification" in e.stderr: | ||
srun_prefix += " --gpus=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it hurt if --gpus
is always set? Will that work for IL1 or EOS? Alternatively, can we use SlurmSystem.gpus_per_node
value to check if --gpus
should be specified?
Main idea is to construct CMD correctly from the start and do not retry based on an error.
P.S. thanks for moving code into own function, it totally make sense.
P.P.S. please extend unit tests (tests/test_docker_image_cache_manager.py
), you'll find many examples there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hurts. EOS fails if --gpus is set unfortunatelly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am asking EoS team to see why --gpus makes error
SlurmSystem.gpus_per_node
might works, below are what I gets from EoS and DFW
DFW:
(venv) lilyw@cw-dfw-cs-001-login-01:/lustre/fsw/portfolios/general/users/lilyw/cloudaix$ sinfo -eO "NodeList,Gres"
NODELIST GRES
cw-dfw-h100-001-004-gpu:8
cw-dfw-cpu1-003-017-(null)
EoS:
lilyw@login-eos01:/lustre/fsw/sw_aidot/lilyw$ sinfo -eO "NodeList,Gres"
NODELIST GRES
eos[0001-0096,0113-0(null)
Seems EoS doesn't specify any GRES to all nodes, so --gpus may fail
And DFW has specified GRES, so --gpus works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloudAI is using SlurmSystem.gpus_per_node
for sbatch scripts generation:
...
if self.system.gpus_per_node:
batch_script_content.append(f"#SBATCH --gpus-per-node={self.system.gpus_per_node}")
batch_script_content.append(f"#SBATCH --gres=gpu:{self.system.gpus_per_node}")
...
So if we are relying on it for sbatch, it make sense to use it for srun as well. Please double check on IL1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so system.gpus_per_node
relies on the same parameter set in input system.toml
file
And I think those parameters in system.toml
file are being set manually? in cloudaix, IL1
and EoS
doesn't have gpus_per_node
, and Corewave
has it
Wondering is there a formal way to determine the parameters for different clusters when generating system.toml
?
cc += @srivatsankrishnan
I will test on IL1 once I got the cluster permission
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, such requirements are per-cluster and there is no way to know that upfront.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lilyw97 ,
- Please use the gpus_per_node field in the system schema, and add the GPU option when it is presented.
- Please ensure that your PR works on IL-1, EOS, and Coreweave.
…require in command line
be75b0d
to
589f2ae
Compare
Summary
Currently the docker image enroot command only contains "account", "partition", "docker image url"
But some clusters may require explicit specify the GPU resource by using
--gpu=N
, likeCW-DFW
Test Plan
Use Cloudaix to do the testing
python cloudaix.py install --tests-dir ./conf/staging/dgx_cloud_acceptance_test/test/ --system-config ./conf/common/system/coreweave.toml
Additional Notes
--gpu=1
will only be added into enroot command when the original command gotCannot find GPU specification
error