Do not require enroot on head node #323

amaslenn · 2025-01-06T15:49:08Z

Summary

Do not check image accessibility from the head node using enroot: many systems don't allow such actions. Instead, rely on actual srun ... enroot import ... result and report its error to user.

Addresses https://redmine.mellanox.com/issues/4212401.

Test Plan

CI
Manual on EOS with enabled caching.

For both runs below the failure is expected as test TOML has invalid URL (docker://DOCKER_IMAGE).

First run:

cloudai install --system-config eos.toml --tests-dir conf/common/test/
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Not all components are ready
[INFO] Going to install 5 item(s)
[INFO] 1/5 Installation of PythonExecutable(git_url=https://github.com/NVIDIA/NeMo-Framework-Launcher.git, commit_hash=599ecfcbbd64fd2de02f2cc093b1610d73854022): OK
[ERROR] 2/5 Installation of DockerImage(url=DOCKER_IMAGE): Failed to import Docker image DOCKER_IMAGE. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/__DOCKER_IMAGE__notag.sqsh docker://DOCKER_IMAGE. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1760479 queued and waiting for resources
srun: job 1760479 has been allocated resources
srun: error: eos0549: task 0: Exited with exit code 1
srun: Terminating StepId=1760479.0
[ERROR] Invalid image reference: docker://DOCKER_IMAGE

[INFO] 3/5 Installation of DockerImage(url=nvcr.io/nvidia/pytorch:24.02-py3): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__pytorch__24.02-py3.sqsh.
[INFO] 4/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh.
[INFO] 5/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.05.01): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__nemo__24.05.01.sqsh.
[ERROR] 1 item(s) failed to install.

Second run:

cloudai install --system-config eos.toml --tests-dir conf/common/test/
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Not all components are ready
[INFO] Going to install 5 item(s)
[WARNING] Git repository already exists at /cloudai-install/NeMo-Framework-Launcher__599ecfcbbd64fd2de02f2cc093b1610d73854022.
[WARNING] Virtual environment already exists at /cloudai-install/NeMo-Framework-Launcher__599ecfcbbd64fd2de02f2cc093b1610d73854022-venv.
[INFO] 1/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh.
[INFO] 2/5 Installation of DockerImage(url=nvcr.io/nvidia/pytorch:24.02-py3): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__pytorch__24.02-py3.sqsh.
[INFO] 3/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.05.01): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__nemo__24.05.01.sqsh.
[INFO] 4/5 Installation of PythonExecutable(git_url=https://github.com/NVIDIA/NeMo-Framework-Launcher.git, commit_hash=599ecfcbbd64fd2de02f2cc093b1610d73854022): OK
[ERROR] 5/5 Installation of DockerImage(url=DOCKER_IMAGE): Failed to import Docker image DOCKER_IMAGE. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/__DOCKER_IMAGE__notag.sqsh docker://DOCKER_IMAGE. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1760594 queued and waiting for resources
srun: job 1760594 has been allocated resources
[ERROR] Invalid image reference: docker://DOCKER_IMAGE
srun: error: eos0022: task 0: Exited with exit code 1
srun: Terminating StepId=1760594.0

[ERROR] 1 item(s) failed to install.

Additional Notes

—

Some slurm setups do not allow running enroot from the head node. Let's rely on actual 'enroot import' run via srun and report its real error message to user.

TaekyungHeo

It looks good. However, I have one concern: what if the system does not have enough nodes to run srun enroot? In general, all EOS nodes are busy. If that happens, any URL checks could be blocked indefinitely.
I see two types of changes: bug fixes and refactoring. We might need to spin off the refactoring changes as a separate PR. This step is optional and entirely up to you.

srivatsankrishnan

On Systems like IL1, checking it on the head node is useful since most of the time all the nodes are reserved and enroot is available in the headnode. In CW, we do have enroot on the headnode (in data copier and vscode nodes). On EoS it seems like not available.

Is this going to be a policy decision that we will always use a resource allocation when installing CloudAI? Also like Taekyung mentioned, getting allocation on busy clusters could be an issue. I don't have an issue with this PR if it's a policy decision. If the user wants to install cloudAI they better have a valid allocation in these clusters in the first place and that would be a requirement.

amaslenn · 2025-01-07T08:10:17Z

Thanks @TaekyungHeo and @srivatsankrishnan!

This PR removes URL accessibility check completely, we now fully rely on enroot import ... result and output. I find its output more understandable and actionable as we also provide enroot's cmd. In many cases token issues were reported as CloudAI issues, now it should be clear for users where the problem is. For example, output from a run with remove access token:

[ERROR] 4/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Failed to import Docker image nvcr.io/nvidia/nemo:24.09. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh docker://nvcr.io/nvidia/nemo:24.09. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1763168 queued and waiting for resources
srun: job 1763168 has been allocated resources
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/nemo/manifests/24.09 returned error code: 401 Unauthorized
srun: error: eos0469: task 0: Exited with exit code 1
srun: Terminating StepId=1763168.0

Busy nodes concern. Installation anyway uses srun, so in this sense behavior is not changed. Indeed, some envs could be a problem, that is why we can disable caching. I've tested this on EOS and was able to run 4 installations in <2 hours.
Bugfix & refactoring. I'm not sure what is what here: removed function led to an unused arg and testing on EOS required passing account which required passing whole SlurmSystem. I've tried not to change anything unrelated.
IL1 question. I can't run enroot on IL1 headnode anymore, can you? I believe there was a change in system, that is why we have mentioned bug opened and other users reporting similar problems as well.

amaslenn · 2025-01-07T08:47:13Z

Verified on IL1, works as expected.

amaslenn added 5 commits January 6, 2025 15:54

Do not check image accessibility using "local" enroot

67909de

Some slurm setups do not allow running enroot from the head node. Let's rely on actual 'enroot import' run via srun and report its real error message to user.

Do not require enroot binary on head node

55d8be4

Pass SlurmSystem into DockerImageCacheManager

6a1c5b8

Specify account while caching images

b777d6f

Reduce noise in CLI output

e32649a

amaslenn requested review from TaekyungHeo, srivatsankrishnan, srinivas212 and Bohatchuk January 6, 2025 15:49

amaslenn added 2 commits January 6, 2025 16:52

Merge branch 'main' into am/no-access-check

13a6dd2

Make ruff happy

e9f2ee9

TaekyungHeo added the bug Something isn't working label Jan 6, 2025

TaekyungHeo reviewed Jan 6, 2025

View reviewed changes

srivatsankrishnan reviewed Jan 7, 2025

View reviewed changes

TaekyungHeo approved these changes Jan 7, 2025

View reviewed changes

srivatsankrishnan approved these changes Jan 7, 2025

View reviewed changes

amaslenn merged commit 1bfdef8 into main Jan 7, 2025
2 checks passed

amaslenn deleted the am/no-access-check branch January 7, 2025 16:05

amaslenn mentioned this pull request Jan 8, 2025

Do not require enroot on head node (backport #323) #326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not require enroot on head node #323

Do not require enroot on head node #323

amaslenn commented Jan 6, 2025

TaekyungHeo left a comment •

edited

Loading

srivatsankrishnan left a comment

amaslenn commented Jan 7, 2025

amaslenn commented Jan 7, 2025

Do not require enroot on head node #323

Do not require enroot on head node #323

Conversation

amaslenn commented Jan 6, 2025

Summary

Test Plan

Additional Notes

TaekyungHeo left a comment • edited Loading

Choose a reason for hiding this comment

srivatsankrishnan left a comment

Choose a reason for hiding this comment

amaslenn commented Jan 7, 2025

amaslenn commented Jan 7, 2025

TaekyungHeo left a comment •

edited

Loading