MVP support / extension of support for serving workloads #2717

mimowo · 2024-07-29T13:05:23Z

What would you like to be added:

I would like to make sure we have basic support for running serving workloads for the use case of running AI inference.
In particular I would like to have support for Deployments, StatefulSets, and LeaderWorkerSets.

In the MVP work the integrations are based on single Plain pods (for Deployments) or Pod Groups (for StatefulSets).

Deployments
This is a follow up to Document how to use Kueue for Deployments #2677.

What is needed:

introduce a dedicated Deployment integration, and validate that it can only be enabled when pod integration is enabled
copy the queue-name from Deployment down to PodTemlates

StatefulSets

What is needed:

introduce a dedicated StatefulSet integration, and validate that it can only be enabled when pod integration is enabled
copy the queue-name from StatefulSet down to PodTemlates
set the PodTemaplate labels for the PodGroup:

kueue.x-k8s.io/queue-name - from STS
kueue.x-k8s.io/pod-group-name - STS_ + STS name (+ probably some hash to avoid collisions as for workloads)
kueue.x-k8s.io/pod-group-total-count - STS replica count

In the longer run to support scaling of stateful sets we may need to do #77, but this is out of scope for the issue,

LeaderWorkerSet support is moved to a dedicated issue: MVP support for serving workloads running as LeaderWorkerSet #3232

Why is this needed:

To support use cases of running AI training and inference in the same clusters, where the access to GPU is constrained by Kueue.

Completion requirements:

The API changes required are minimal (just potentially new labels / annotations), so I believe a new KEP is not required, but we need a proper documentation.

This enhancement requires the following artifacts:

Docs update

The artifacts should be linked in subsequent comments.

mimowo · 2024-07-29T13:10:26Z

/assign @trasc

mimowo · 2024-07-29T13:15:40Z

/cc @mwielgus @tenzen-y @dgrove-oss

kannon92 · 2024-07-29T13:51:30Z

/cc @liurupeng @ahg-g
for LWS.

kannon92 · 2024-07-29T13:52:11Z

For LWS, would including a suspend field be a better forward thinking strategy?

mimowo · 2024-07-29T14:23:58Z

For LWS, would including a suspend field be a better forward thinking strategy?

For now complete "suspend" for serving workload isn't a use case we hear about. The preference is to reduce capacity by preempting individual pods, so that stopping a serving workload completely is the last resort option.

However, it is hard to say "never" in the long run, but I would keep it out of scope for this enhancement.

kannon92 · 2024-07-29T14:27:57Z

Sounds good. I guess in LWS case preemptiong would be the entire leader-worker group? Or preempting some workers?

mimowo · 2024-07-29T14:31:19Z

For now, the entire group.

vladikkuzn · 2024-08-07T09:00:38Z

/assign

tenzen-y · 2024-08-30T16:44:04Z

It looks like that this contains LWS and StatefulSet.
/reopen

k8s-ci-robot · 2024-08-30T16:44:08Z

@tenzen-y: Reopened this issue.

In response to this:

It looks like that this contains LWS and StatefulSet.
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kerthcet · 2024-10-12T03:05:44Z

For now complete "suspend" for serving workload isn't a use case we hear about.

+1 on behave of LWS. And evicting the entire group(leaderPod + workerSts) is the right path because they working as an unit, what you need to do is just reduce the Replicas to the resource boundary.

Some other feedbacks, as the maintainer of llmaz, another inference platform, what we need most is the capacity of accelerator fungibility, the same model could be served by several different kinds of GPUs for the sake of cost and performance. I think kueue can help in some ways, actually part of our integration roadmap. Ourself will implement the capacity as well but considering our customers are also using kueue, this could be a centralized control plane.

tenzen-y · 2024-10-14T15:16:32Z

@mimowo Couldn't we split LWS to a separate issue as we mentioned in the next release issue?

mimowo · 2024-10-14T15:21:22Z

Sure, we can, would you like to do so? Otherwise I can split it tomorrow.

tenzen-y · 2024-10-14T15:26:29Z

Sure, we can, would you like to do so? Otherwise I can split it tomorrow.

I'm not in a hurry. So, I'm ok with tomorrow.

mimowo · 2024-10-15T06:50:33Z

Done: #3232. PTAL

mimowo · 2024-10-23T13:50:21Z

/reopen
Let's close it when documentation for StatefulSet lends. cc @vladikkuzn

k8s-ci-robot · 2024-10-23T13:50:26Z

@mimowo: Reopened this issue.

In response to this:

/reopen
Let's close it when documentation for StatefulSet lends. cc @vladikkuzn

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 29, 2024

k8s-ci-robot assigned trasc Jul 29, 2024

trasc mentioned this issue Aug 5, 2024

[jobframework] Add integration's framework dependencies. #2768

Merged

k8s-ci-robot assigned vladikkuzn Aug 7, 2024

vladikkuzn mentioned this issue Aug 11, 2024

[jobframework] Deployment integration #2813

Merged

k8s-ci-robot closed this as completed in #2813 Aug 30, 2024

k8s-ci-robot reopened this Aug 30, 2024

mimowo mentioned this issue Sep 5, 2024

deployment webhook is broken #2987

Closed

vladikkuzn mentioned this issue Sep 5, 2024

StatefulSet integration #3001

Merged

This was referenced Oct 4, 2024

☂️ Release v0.9.0 requirements #3192

Closed

Fast admission annotation #3189

Merged

mimowo mentioned this issue Oct 15, 2024

MVP support for serving workloads running as LeaderWorkerSet #3232

Open

This was referenced Oct 16, 2024

Add kueue.x-k8s.io/pod-group-fast-admission annotation on docs. #3241

Merged

Add deployment integration framework on manifests. #3242

Merged

mimowo mentioned this issue Oct 21, 2024

MVP support for arbitrary resizing a StatefulSet (investigate if feasible) #3279

Open

k8s-ci-robot closed this as completed in #3001 Oct 23, 2024

k8s-ci-robot reopened this Oct 23, 2024

vladikkuzn mentioned this issue Oct 24, 2024

Stateful set integration docs #3312

Merged

trasc removed their assignment Oct 25, 2024

k8s-ci-robot closed this as completed in #3312 Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVP support / extension of support for serving workloads #2717

MVP support / extension of support for serving workloads #2717

mimowo commented Jul 29, 2024 •

edited

Loading

mimowo commented Jul 29, 2024

mimowo commented Jul 29, 2024

kannon92 commented Jul 29, 2024

kannon92 commented Jul 29, 2024

mimowo commented Jul 29, 2024

kannon92 commented Jul 29, 2024

mimowo commented Jul 29, 2024

vladikkuzn commented Aug 7, 2024

tenzen-y commented Aug 30, 2024

k8s-ci-robot commented Aug 30, 2024

kerthcet commented Oct 12, 2024

tenzen-y commented Oct 14, 2024

mimowo commented Oct 14, 2024

tenzen-y commented Oct 14, 2024

mimowo commented Oct 15, 2024

mimowo commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

MVP support / extension of support for serving workloads #2717

MVP support / extension of support for serving workloads #2717

Comments

mimowo commented Jul 29, 2024 • edited Loading

mimowo commented Jul 29, 2024

mimowo commented Jul 29, 2024

kannon92 commented Jul 29, 2024

kannon92 commented Jul 29, 2024

mimowo commented Jul 29, 2024

kannon92 commented Jul 29, 2024

mimowo commented Jul 29, 2024

vladikkuzn commented Aug 7, 2024

tenzen-y commented Aug 30, 2024

k8s-ci-robot commented Aug 30, 2024

kerthcet commented Oct 12, 2024

tenzen-y commented Oct 14, 2024

mimowo commented Oct 14, 2024

tenzen-y commented Oct 14, 2024

mimowo commented Oct 15, 2024

mimowo commented Oct 23, 2024

k8s-ci-robot commented Oct 23, 2024

mimowo commented Jul 29, 2024 •

edited

Loading