-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Kuberay RayJob MultiKueue adapter #3892
base: main
Are you sure you want to change the base?
[Feature] Kuberay RayJob MultiKueue adapter #3892
Conversation
Skipping CI for Draft Pull Request. |
/ok-to-test |
✅ Deploy Preview for kubernetes-sigs-kueue ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -70,6 +70,7 @@ IMAGE_TAG ?= $(IMAGE_REPO):$(GIT_TAG) | |||
JOBSET_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" sigs.k8s.io/jobset) | |||
KUBEFLOW_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/training-operator) | |||
KUBEFLOW_MPI_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/kubeflow/mpi-operator) | |||
KUBERAY_VERSION = $(shell $(GO_CMD) list -m -f "{{.Version}}" github.com/ray-project/kuberay/ray-operator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it's the version without managedBy.
We don't have a ray-operator image that supports managedBy, that would have to be a custom one...
latest -> 1.2.2, which is the last release
export KUBERAY_MANIFEST="${ROOT_DIR}/dep-crds/ray-operator/default/" | ||
export KUBERAY_IMAGE=bitnami/kuberay-operator:${KUBERAY_VERSION/#v} | ||
export KUBERAY_RAY_IMAGE=rayproject/ray:2.9.0 | ||
export KUBERAY_RAY_IMAGE_ARM=rayproject/ray:2.9.0-aarch64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is for us people working on macOS, it's vital for development, not so much for prod - I think it should stay
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
d27f12c
to
ad4df0d
Compare
pkg/controller/jobs/raycluster/raycluster_multikueue_adapter.go
Outdated
Show resolved
Hide resolved
ad4df0d
to
b13ce9a
Compare
/retest |
d3974b4
to
3c3a0c7
Compare
906cc02
to
6f49e56
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mszadkow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6f49e56
to
1801ddc
Compare
Although the PR for Kuberay RayJob was ready for review before New Year, I couldn’t finish to work on it due to peculiar issue.
Namely Envtest failed on setup and with couple of different yet relating errors:
Those errors were distributed over all cluster routines (as clusters run in separate go routines). During the investigation with Mykhailo we were able to boil down the issue to simple addition of “ray-operator” CRDS to the suite. We also checked:
Finally what worked was to reduce the Which is still not understood why, but we assume that maybe apiserver of envtest got overwhelmed and possibly test-infra requires resource update. We don’t have any proof that this will work, but some people reportedly so similar issues (not the exact ones):
Maybe any of you guys encounter similar issue in the past? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that you had to reduce INTEGRATION_NPROCS I suspect the cluster running the integration tests is resource constrained. However, I'm wondering if INTEGRATION_API_LOG_LEVEL=1
or --output-interceptor-mode=none
could be making a difference here?
To check if the tests are stable: |
Those were added later to observe what is happening (after the behaviour started), but good point I will remove them. |
/test pull-kueue-test-integration-main |
@@ -37,12 +37,12 @@ ENVTEST_K8S_VERSION ?= 1.31 | |||
# Number of processes to use during integration tests to run specs within a | |||
# suite in parallel. Suites still run sequentially. User may set this value to 1 | |||
# to run without parallelism. | |||
INTEGRATION_NPROCS ?= 4 | |||
INTEGRATION_NPROCS ?= 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try to revert it and post a link to the failure log?
Actually, could you maybe split the PR into smaller parts:
|
Fix path for Kuberay CRDs
…d that overwhelms ApiServer.
1801ddc
to
106712f
Compare
I guess it's no longer valid, as there is no fix, rather extension for MK |
What type of PR is this?
/kind feature
What this PR does / why we need it:
We want to be bale to run Kuberay workloads RayJob in MultiKueue.
RayCluster will come as separate PR.
Which issue(s) this PR fixes:
Relates to #3822
Special notes for your reviewer:
Does this PR introduce a user-facing change?