Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not require enroot on head node #323

Merged
merged 7 commits into from
Jan 7, 2025
Merged

Do not require enroot on head node #323

merged 7 commits into from
Jan 7, 2025

Conversation

amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Jan 6, 2025

Summary

Do not check image accessibility from the head node using enroot: many systems don't allow such actions. Instead, rely on actual srun ... enroot import ... result and report its error to user.

Addresses https://redmine.mellanox.com/issues/4212401.

Test Plan

  1. CI
  2. Manual on EOS with enabled caching.

For both runs below the failure is expected as test TOML has invalid URL (docker://DOCKER_IMAGE).

First run:

cloudai install --system-config eos.toml --tests-dir conf/common/test/
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Not all components are ready
[INFO] Going to install 5 item(s)
[INFO] 1/5 Installation of PythonExecutable(git_url=https://github.com/NVIDIA/NeMo-Framework-Launcher.git, commit_hash=599ecfcbbd64fd2de02f2cc093b1610d73854022): OK
[ERROR] 2/5 Installation of DockerImage(url=DOCKER_IMAGE): Failed to import Docker image DOCKER_IMAGE. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/__DOCKER_IMAGE__notag.sqsh docker://DOCKER_IMAGE. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1760479 queued and waiting for resources
srun: job 1760479 has been allocated resources
srun: error: eos0549: task 0: Exited with exit code 1
srun: Terminating StepId=1760479.0
[ERROR] Invalid image reference: docker://DOCKER_IMAGE

[INFO] 3/5 Installation of DockerImage(url=nvcr.io/nvidia/pytorch:24.02-py3): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__pytorch__24.02-py3.sqsh.
[INFO] 4/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh.
[INFO] 5/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.05.01): Docker image cached successfully at /cloudai-install/nvcr.io_nvidia__nemo__24.05.01.sqsh.
[ERROR] 1 item(s) failed to install.

Second run:

cloudai install --system-config eos.toml --tests-dir conf/common/test/
[INFO] System Name: EOS
[INFO] Scheduler: slurm
[INFO] Not all components are ready
[INFO] Going to install 5 item(s)
[WARNING] Git repository already exists at /cloudai-install/NeMo-Framework-Launcher__599ecfcbbd64fd2de02f2cc093b1610d73854022.
[WARNING] Virtual environment already exists at /cloudai-install/NeMo-Framework-Launcher__599ecfcbbd64fd2de02f2cc093b1610d73854022-venv.
[INFO] 1/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh.
[INFO] 2/5 Installation of DockerImage(url=nvcr.io/nvidia/pytorch:24.02-py3): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__pytorch__24.02-py3.sqsh.
[INFO] 3/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.05.01): Cached Docker image already exists at /cloudai-install/nvcr.io_nvidia__nemo__24.05.01.sqsh.
[INFO] 4/5 Installation of PythonExecutable(git_url=https://github.com/NVIDIA/NeMo-Framework-Launcher.git, commit_hash=599ecfcbbd64fd2de02f2cc093b1610d73854022): OK
[ERROR] 5/5 Installation of DockerImage(url=DOCKER_IMAGE): Failed to import Docker image DOCKER_IMAGE. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/__DOCKER_IMAGE__notag.sqsh docker://DOCKER_IMAGE. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1760594 queued and waiting for resources
srun: job 1760594 has been allocated resources
[ERROR] Invalid image reference: docker://DOCKER_IMAGE
srun: error: eos0022: task 0: Exited with exit code 1
srun: Terminating StepId=1760594.0

[ERROR] 1 item(s) failed to install.

Additional Notes

Some slurm setups do not allow running enroot from the head node. Let's
rely on actual 'enroot import' run via srun and report its real error
message to user.
@TaekyungHeo TaekyungHeo added the bug Something isn't working label Jan 6, 2025
Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It looks good. However, I have one concern: what if the system does not have enough nodes to run srun enroot? In general, all EOS nodes are busy. If that happens, any URL checks could be blocked indefinitely.
  • I see two types of changes: bug fixes and refactoring. We might need to spin off the refactoring changes as a separate PR. This step is optional and entirely up to you.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Systems like IL1, checking it on the head node is useful since most of the time all the nodes are reserved and enroot is available in the headnode. In CW, we do have enroot on the headnode (in data copier and vscode nodes). On EoS it seems like not available.

Is this going to be a policy decision that we will always use a resource allocation when installing CloudAI? Also like Taekyung mentioned, getting allocation on busy clusters could be an issue. I don't have an issue with this PR if it's a policy decision. If the user wants to install cloudAI they better have a valid allocation in these clusters in the first place and that would be a requirement.

@amaslenn
Copy link
Contributor Author

amaslenn commented Jan 7, 2025

Thanks @TaekyungHeo and @srivatsankrishnan!

This PR removes URL accessibility check completely, we now fully rely on enroot import ... result and output. I find its output more understandable and actionable as we also provide enroot's cmd. In many cases token issues were reported as CloudAI issues, now it should be clear for users where the problem is. For example, output from a run with remove access token:

[ERROR] 4/5 Installation of DockerImage(url=nvcr.io/nvidia/nemo:24.09): Failed to import Docker image nvcr.io/nvidia/nemo:24.09. Command: srun --export=ALL --partition=batch --account=... enroot import -o /cloudai-install/nvcr.io_nvidia__nemo__24.09.sqsh docker://nvcr.io/nvidia/nemo:24.09. Error: srun: WARNING: Please set a name for this job, formatted like this:
srun: 	...-<subproject>.<details>
srun: job 1763168 queued and waiting for resources
srun: job 1763168 has been allocated resources
[INFO] Querying registry for permission grant
[INFO] Authenticating with user: <anonymous>
[INFO] Authentication succeeded
[INFO] Fetching image manifest list
[INFO] Fetching image manifest
[ERROR] URL https://registry-1.docker.io/v2/nvcr.io/nvidia/nemo/manifests/24.09 returned error code: 401 Unauthorized
srun: error: eos0469: task 0: Exited with exit code 1
srun: Terminating StepId=1763168.0
  1. Busy nodes concern. Installation anyway uses srun, so in this sense behavior is not changed. Indeed, some envs could be a problem, that is why we can disable caching. I've tested this on EOS and was able to run 4 installations in <2 hours.
  2. Bugfix & refactoring. I'm not sure what is what here: removed function led to an unused arg and testing on EOS required passing account which required passing whole SlurmSystem. I've tried not to change anything unrelated.
  3. IL1 question. I can't run enroot on IL1 headnode anymore, can you? I believe there was a change in system, that is why we have mentioned bug opened and other users reporting similar problems as well.

@amaslenn
Copy link
Contributor Author

amaslenn commented Jan 7, 2025

Verified on IL1, works as expected.

@amaslenn amaslenn merged commit 1bfdef8 into main Jan 7, 2025
2 checks passed
@amaslenn amaslenn deleted the am/no-access-check branch January 7, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants