-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Update Jobeframework's IsActive() for RayJob #3949
base: main
Are you sure you want to change the base?
[WIP] Update Jobeframework's IsActive() for RayJob #3949
Conversation
To allow for workload creation in MultiKueue clusters.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mszadkow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@andrewsykim @mimowo |
/cc @andrewsykim @mimowo |
Could you clarify what are the user-facing consequences of the bug? Is it reproducible on a running system reliably or a rare race condition?
Can you show a link where this is used specifically so that we can better understand the e2e consequences of the issue? |
Also, would this issue be reproducible and testable with integration tests for the RayJob, as being added in #3892? |
Ray-operator is the one to change the JobDeploymentStatus from We need some solution for that problem. |
I'm still not sure what makes it MK-specific. Can this happen if a regular RayJob is getting suspended, for example, due to preemption or setting the CQ to inactive? Also, in integration tests we could simulate what the Ray-operator would do. |
|
Thank you for the summary @mszadkow - this clarifies a lot. Since this is MK-specific I would prefer to again combine the change with the main PR - as initially done.
This makes me wonder about the remaining question - why we didn't have an issue with IsActive for other JobCRDs before supporting managedBy in MultiKueue, for example, with kubeflow Jobs? |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The default value for the JobDeploymentStatus is “”, which equals JobDeploymentStatusNew.
Ray-operator is the one to change the JobDeploymentStatus from
New
toInitialising
.When the RayJob is Suspend it puts the JobDeploymentStatus to
Suspending
and then toSuspended
.However in MultiKueue setup we lack
ray-operator
in the manager cluster.JobDeploymentStatus prevents workloads from being created in the manager cluster and copied to worker clusters as it remains "New" and it's not updated to
Suspended
.Which issue(s) this PR fixes:
Relates #3822
Special notes for your reviewer:
Does this PR introduce a user-facing change?