Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: affinity assignment with TAS #3941

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kerthcet
Copy link
Contributor

@kerthcet kerthcet commented Jan 8, 2025

What type of PR is this?

Make sure the podsets will be colocated. This is a MVP implementation. We may have annotation to enable this in the end.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Signed-off-by: kerthcet <[email protected]>
@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kerthcet
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2025
@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 8, 2025

/test all

Copy link

netlify bot commented Jan 8, 2025

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit ba85310
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6780c4642caa930008b97bbe

@kerthcet kerthcet force-pushed the feat/affinity-assignment branch 2 times, most recently from 38a3700 to 4b49b46 Compare January 8, 2025 03:25
@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 8, 2025

/test all

@k8s-ci-robot
Copy link
Contributor

@kerthcet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-tas-e2e-main 4b49b46 link true /test pull-kueue-test-tas-e2e-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 8, 2025

This can work with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"

However, can't play with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"
  - nodeLabel: "kubernetes.io/hostname"

Will lead to an infinite cycle.

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 8, 2025

This can work with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"

However, can't play with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"
  - nodeLabel: "kubernetes.io/hostname"

Will lead to an infinite cycle.

This is because we just reset the states of prefilledDomains and their fathers, but didn't sub the children's states.

@kerthcet kerthcet force-pushed the feat/affinity-assignment branch from 4b49b46 to ba85310 Compare January 10, 2025 06:55
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 10, 2025
@kerthcet
Copy link
Contributor Author

Hi we explored the MVP affinity assignments between different podsets, here's our initial thoughts and the result.

TL;DR: if we have n podsets for a single job, we'll make sure the [n+1] podset will try its best to colocate with the [n] podset in the same topology or father topologies. Take the pytorchJob for example, the woker[0] will be colocated with the master.

Basically, we just record the assigned domains as affinityDomains, and when allocating, we'll scheudle the podsets to the prefilled domains first, then fall into the normal allocation logic with several values tuned, e.g. the count number and the domain state, this is because we have to subtract the allocation result of prefilled domains. More details in the code.

However, we have downside here, because we don't have the overview about the total counts for different parts of jobsets (master and worker for pytorchJob), we have to allocate the master first, which means quota might be enough for master podsets but the worker ones, then the worker podsets may cross domains. This is not that ideal, but I think the problem still exist with today's design, and hard to solve unless we take the job as a whole part for scheduling, rather than podsets separately.

Any thoughts?

@kerthcet
Copy link
Contributor Author

ping @mimowo @PBundyra

loopDomain = loopDomain.parent
parentDomainIDs = append(parentDomainIDs, loopDomain.id)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a lot of this logic is similar to one in the findLevelWithFitDomains function. Maybe we could commonize something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can commonize them later, just want to hear some high level suggestions about the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already applied in this way with a new built image to avoid such unexpections.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not yet convinced about the idea. The list of concerns I have:

  1. there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,
  2. I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role
  3. I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.

For now, (2.) is the main importance for now.

Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.

Copy link
Contributor Author

@kerthcet kerthcet Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions, some feedbacks:

there is no mechanism for users to opt-out

This is just a rough idea, definitely need some opt-in mechanism, like annotations, I just want to hear some advices.

because the "affinity" will play a bigger role

If this is an opt-in, then there's no problem because it acts as you expected, you want affinity & required, if doesn't fit, fall into failure.

I like the 3) idea, looks like a Weighted Tree. This is just an urgent quick fix from our side, lower risk but we definitely want to find a long-term solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think we still need to have a job-level view, then we can make better placement for the whole podsets.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// lex sort domains by their levelValues instead of IDs, as leaves' IDs can only contain the hostname
slices.SortFunc(domains, func(a, b *domain) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can drop it - when the final assignment is being constructed it is important to return the domains the tree order (lexicographical), so that the TopologyUngater correctly does rank-based ordering.

loopDomain = loopDomain.parent
parentDomainIDs = append(parentDomainIDs, loopDomain.id)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm not yet convinced about the idea. The list of concerns I have:

  1. there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,
  2. I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role
  3. I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.

For now, (2.) is the main importance for now.

Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants