-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix for invalid job id for many parallel cloudAI jobs #314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SlurmInstaller
hasPREREQUISITES
, let's addsacct
there too.- Please add unit tests that reproduces the issue before this change to ensure we won't repeat this problem
Fixed 1. Regarding 2, this is a purely system behavior and not sure how we can capture this in unit test. These commands require If you have better ideas on capturing it, please follow the slack thread and please feel free to contribute to the unit test to capture this behavior in a seperate PR. I think this can be a good follow up PR and should not block customer request. However, this PR has been tested with verification team's NCCL test on internal cluster. It has also been stress tested by simultaneously launching 111 jobs on a different production cluster. Without this PR, the original design choice used in CloudAI on checking the job completion would fail for this new customer setup. So given this has been solidly tested on production system (see the test plan), we should approve and merge this. |
We have an agreement to add unit tests for all new features and fixes. Let's stick to this agreement. Here is how I would approach testing @pytest.mark.parametrize("stdout,is_running", [("RUNNING", True), ("PENDING", True), ("COMPLETED", False)])
def test_is_job_running(stdout: str, is_running: bool, slurm_system: SlurmSystem):
job = SlurmJob(Mock(), 1)
pp = Mock()
pp.communicate = Mock(return_value=(stdout, ""))
slurm_system.cmd_shell.execute = Mock(return_value=pp)
assert slurm_system.is_job_running(job) is is_running Similar approach can be applied for testing with
|
You are right. I am trying to stick to the agreement here. The authors adding a new feature should also extend the unit test for that feature. However, this is not a new feature. This was a bug fix where in certain high job submission load the CloudAI breaks due to a design choice we made ~7 months ago. Hence, this PR is basically to address and fix this bug. It does not change the interface. To cover these system related issues, I have extensively tested on two clusters (including stressing it based on the customer requirement). The original design discussion also mentions this issue on why adding more unit test for this system class would not extend the coverage and in fact this bug further validates it. We agreed and approved this PR (including the comments on CI test plan). The unit test you proposing also will not provide coverage this behavior as well. I can explain the behavior and maybe you can see if unit testing can support or fake this. Merely capturing the stdout and faking the outputs will not give coverage to this or future system related bug. We need to launch 100+ fake processes and have 1 master process (cloudai executable) and capture its interactions/feedbacks. If you think unit test infra features we have today can also simulate this runtime behavior, please see this as an opportunity to solidify it. But this shouldn't block this PR and should be a separate PR imo. |
5cb4c51
to
2bbfc15
Compare
Summary
The current cloudAI slurm runner fails when many parallel jobs are submitted. The current strategy is to use
squeue
with the job id. However, when many CloudAI jobs are submitted in parallel, some of these jobs might complete at the same time resulting in invalid job id error by the time CloudAI queries this status.This PR fixes this issue by using an alternative way to determine the job completion status instead of
squeue
. ITest Plan
CI/CD
Run on real system (Job completion status works and moves on to the next job).
Stress test on another internal cluster with simultaneous 111 job submission using CloudAI
Additional Notes
Context: Discussion thread.