-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: affinity assignment with TAS #3941
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: kerthcet <[email protected]>
Signed-off-by: kerthcet <[email protected]>
Skipping CI for Draft Pull Request. |
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: kerthcet The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
38a3700
to
4b49b46
Compare
/test all |
@kerthcet: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This can work with - nodeLabel: "topology-key/rdma"
- nodeLabel: "topology-key/supernode" However, can't play with - nodeLabel: "topology-key/rdma"
- nodeLabel: "topology-key/supernode"
- nodeLabel: "kubernetes.io/hostname" Will lead to an infinite cycle. |
This is because we just reset the states of prefilledDomains and their fathers, but didn't sub the children's states. |
Signed-off-by: kerthcet <[email protected]>
4b49b46
to
ba85310
Compare
Hi we explored the MVP TL;DR: if we have n podsets for a single job, we'll make sure the [n+1] podset will try its best to colocate with the [n] podset in the same topology or father topologies. Take the pytorchJob for example, the woker[0] will be colocated with the master. Basically, we just record the assigned domains as However, we have downside here, because we don't have the overview about the total counts for different parts of jobsets (master and worker for pytorchJob), we have to allocate the master first, which means quota might be enough for master podsets but the worker ones, then the worker podsets may cross domains. This is not that ideal, but I think the problem still exist with today's design, and hard to solve unless we take the job as a whole part for scheduling, rather than podsets separately. Any thoughts? |
loopDomain = loopDomain.parent | ||
parentDomainIDs = append(parentDomainIDs, loopDomain.id) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a lot of this logic is similar to one in the findLevelWithFitDomains
function. Maybe we could commonize something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we can commonize them later, just want to hear some high level suggestions about the design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already applied in this way with a new built image to avoid such unexpections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not yet convinced about the idea. The list of concerns I have:
- there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,
- I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role
- I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.
For now, (2.) is the main importance for now.
Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions, some feedbacks:
there is no mechanism for users to opt-out
This is just a rough idea, definitely need some opt-in mechanism, like annotations, I just want to hear some advices.
because the "affinity" will play a bigger role
If this is an opt-in, then there's no problem because it acts as you expected, you want affinity & required
, if doesn't fit, fall into failure.
I like the 3) idea, looks like a Weighted Tree. This is just an urgent quick fix from our side, lower risk but we definitely want to find a long-term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think we still need to have a job-level view, then we can make better placement for the whole podsets.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
// lex sort domains by their levelValues instead of IDs, as leaves' IDs can only contain the hostname | ||
slices.SortFunc(domains, func(a, b *domain) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can drop it - when the final assignment is being constructed it is important to return the domains the tree order (lexicographical), so that the TopologyUngater correctly does rank-based ordering.
loopDomain = loopDomain.parent | ||
parentDomainIDs = append(parentDomainIDs, loopDomain.id) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm not yet convinced about the idea. The list of concerns I have:
- there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,
- I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role
- I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.
For now, (2.) is the main importance for now.
Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.
What type of PR is this?
Make sure the podsets will be colocated. This is a MVP implementation. We may have annotation to enable this in the end.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?