-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not require enroot on head node #323
Conversation
Some slurm setups do not allow running enroot from the head node. Let's rely on actual 'enroot import' run via srun and report its real error message to user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It looks good. However, I have one concern: what if the system does not have enough nodes to run
srun enroot
? In general, all EOS nodes are busy. If that happens, any URL checks could be blocked indefinitely. - I see two types of changes: bug fixes and refactoring. We might need to spin off the refactoring changes as a separate PR. This step is optional and entirely up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Systems like IL1, checking it on the head node is useful since most of the time all the nodes are reserved and enroot is available in the headnode. In CW, we do have enroot on the headnode (in data copier and vscode nodes). On EoS it seems like not available.
Is this going to be a policy decision that we will always use a resource allocation when installing CloudAI? Also like Taekyung mentioned, getting allocation on busy clusters could be an issue. I don't have an issue with this PR if it's a policy decision. If the user wants to install cloudAI they better have a valid allocation in these clusters in the first place and that would be a requirement.
Thanks @TaekyungHeo and @srivatsankrishnan! This PR removes URL accessibility check completely, we now fully rely on [ERROR] 4/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Failed to import Docker image nvcr.io/nvidia/nemo:24.09. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh docker://nvcr.io/nvidia/nemo:24.09. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: ...-<subproject>.<details>
srun: job 1763168 queued and waiting for resources
srun: job 1763168 has been allocated resources
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/nemo/manifests/24.09 returned error code: 401 Unauthorized
srun: error: eos0469: task 0: Exited with exit code 1
srun: Terminating StepId=1763168.0
|
Verified on IL1, works as expected. |
Summary
Do not check image accessibility from the head node using
enroot
: many systems don't allow such actions. Instead, rely on actualsrun ... enroot import ...
result and report its error to user.Addresses https://redmine.mellanox.com/issues/4212401.
Test Plan
For both runs below the failure is expected as test TOML has invalid URL (
docker://DOCKER_IMAGE
).First run:
Second run:
Additional Notes
—