Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVP support for serving workloads running as LeaderWorkerSet #3232

Open
Tracked by #3192 ...
mimowo opened this issue Oct 15, 2024 · 4 comments · May be fixed by #4023
Open
Tracked by #3192 ...

MVP support for serving workloads running as LeaderWorkerSet #3232

mimowo opened this issue Oct 15, 2024 · 4 comments · May be fixed by #4023
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2024

What would you like to be added:

MVP support for LeaderWorkerSet in Kueue. It does not need to be ideal, but we want to have some support to unblock users and collect users' feedback.

The idea is to base the support on StatefulSets, so the integration would also use Pod Groups, similarly as for regular StatefulSets. Each LeaderWorkerGroup creates a new Pod Group. I a single pod group we will have:

  • Leader pod, controller by Leader’s STS
  • Worker pods, controller by unique, dedicated STS

The size of the group will be taken from LeaderWorkerSet.Spec.LeaderWorkerTemplate.Size and increased by 1 (to include the leader).

This is a follow up to #2717.

Why is this needed:

We want to support serving primitives in Kueue as there is an increasing demand among users to run clusters mixing AI training and inference who want to manage the expensive GPU resources.

LeaderWorkerSet is a new serving API which is gaining popularity as a primitive to host AI/ML inference.

@mimowo mimowo added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 15, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Oct 15, 2024

/assign @vladikkuzn

@mimowo
Copy link
Contributor Author

mimowo commented Oct 15, 2024

/cc @mwielgus @tenzen-y

@mimowo
Copy link
Contributor Author

mimowo commented Dec 2, 2024

I synced with @mbobrovskyi and @vladikkuzn on the feature and it seems complex, so I propose to have a KEP for it, and go via the Alpha phase so that we can update the implementation in the future easily.

@mimowo
Copy link
Contributor Author

mimowo commented Jan 16, 2025

The already identified follow ups needed after #3515:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants