Feat: affinity assignment with TAS #3941

kerthcet · 2025-01-08T03:00:30Z

What type of PR is this?

Make sure the podsets will be colocated. This is a MVP implementation. We may have annotation to enable this in the end.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Signed-off-by: kerthcet <[email protected]>

k8s-ci-robot · 2025-01-08T03:00:33Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-01-08T03:00:33Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-01-08T03:00:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kerthcet
Once this PR has been reviewed and has the lgtm label, please assign mimowo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kerthcet · 2025-01-08T03:01:18Z

/test all

netlify · 2025-01-08T03:02:33Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`ba85310`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6780c4642caa930008b97bbe

kerthcet · 2025-01-08T03:27:09Z

/test all

k8s-ci-robot · 2025-01-08T03:42:22Z

@kerthcet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-test-tas-e2e-main	`4b49b46`	link	true	`/test pull-kueue-test-tas-e2e-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

kerthcet · 2025-01-08T03:53:26Z

This can work with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"

However, can't play with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"
  - nodeLabel: "kubernetes.io/hostname"

Will lead to an infinite cycle.

kerthcet · 2025-01-08T04:11:10Z

This can work with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"

However, can't play with

  - nodeLabel: "topology-key/rdma"
  - nodeLabel: "topology-key/supernode"
  - nodeLabel: "kubernetes.io/hostname"

Will lead to an infinite cycle.

This is because we just reset the states of prefilledDomains and their fathers, but didn't sub the children's states.

Signed-off-by: kerthcet <[email protected]>

kerthcet · 2025-01-10T08:02:11Z

Hi we explored the MVP affinity assignments between different podsets, here's our initial thoughts and the result.

TL;DR: if we have n podsets for a single job, we'll make sure the [n+1] podset will try its best to colocate with the [n] podset in the same topology or father topologies. Take the pytorchJob for example, the woker[0] will be colocated with the master.

Basically, we just record the assigned domains as affinityDomains, and when allocating, we'll scheudle the podsets to the prefilled domains first, then fall into the normal allocation logic with several values tuned, e.g. the count number and the domain state, this is because we have to subtract the allocation result of prefilled domains. More details in the code.

However, we have downside here, because we don't have the overview about the total counts for different parts of jobsets (master and worker for pytorchJob), we have to allocate the master first, which means quota might be enough for master podsets but the worker ones, then the worker podsets may cross domains. This is not that ideal, but I think the problem still exist with today's design, and hard to solve unless we take the job as a whole part for scheduling, rather than podsets separately.

Any thoughts?

kerthcet · 2025-01-14T06:23:43Z

ping @mimowo @PBundyra

PBundyra · 2025-01-14T12:38:53Z

pkg/cache/tas_flavor_snapshot.go

+			loopDomain = loopDomain.parent
+			parentDomainIDs = append(parentDomainIDs, loopDomain.id)
+		}
+	}


It seems like a lot of this logic is similar to one in the findLevelWithFitDomains function. Maybe we could commonize something?

Yeah, we can commonize them later, just want to hear some high level suggestions about the design.

We already applied in this way with a new built image to avoid such unexpections.

Right, I'm not yet convinced about the idea. The list of concerns I have:

there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,

I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role

I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.

For now, (2.) is the main importance for now.

Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.

Thanks for the suggestions, some feedbacks:

there is no mechanism for users to opt-out

This is just a rough idea, definitely need some opt-in mechanism, like annotations, I just want to hear some advices.

because the "affinity" will play a bigger role

If this is an opt-in, then there's no problem because it acts as you expected, you want affinity & required, if doesn't fit, fall into failure.

I like the 3) idea, looks like a Weighted Tree. This is just an urgent quick fix from our side, lower risk but we definitely want to find a long-term solution.

However, I think we still need to have a job-level view, then we can make better placement for the whole podsets.

k8s-ci-robot · 2025-01-14T20:41:03Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo · 2025-01-15T13:38:27Z

pkg/cache/tas_flavor_snapshot.go

 	// lex sort domains by their levelValues instead of IDs, as leaves' IDs can only contain the hostname
-	slices.SortFunc(domains, func(a, b *domain) int {


I don't think we can drop it - when the final assignment is being constructed it is important to return the domains the tree order (lexicographical), so that the TopologyUngater correctly does rank-based ordering.

mimowo · 2025-01-15T13:50:42Z

pkg/cache/tas_flavor_snapshot.go

+			loopDomain = loopDomain.parent
+			parentDomainIDs = append(parentDomainIDs, loopDomain.id)
+		}
+	}


Right, I'm not yet convinced about the idea. The list of concerns I have:

there is no mechanism for users to opt-out and I think it might be constraining for users where there is not much traffic between the master and worker pods. This is probably acceptable if we make sure the algorithm does not mean regression to some users,

I would like to make sure there is no workload which is unschedulable but could be scheduled before. For example, a workload with two PodSets using "required" could be scheduled before by choosing distant domains, IIUC now the workload will fail, because the "affinity" will play a bigger role

I think the ideal algorithm should not just prioritize the same vs other domain, but count the "distance" in the tree between a pair of domains, and score the domains according to the affinity. There would be two scores then - scoring large domains as currently, and scoring for domains close to the selected ones. The final sorting score would be some kind of mixture between them.

For now, (2.) is the main importance for now.

Regarding this proposal vs. full blown (3.) - I will yet need to think about it and consult it with others. Hopefully we can find some sweet spot with a good gain / effort ratio, but not sure if the current proposal is there.

kerthcet added 2 commits January 7, 2025 13:06

Fix: TAS assignment error

d33865f

Signed-off-by: kerthcet <[email protected]>

fix nil map

4f61106

Signed-off-by: kerthcet <[email protected]>

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 8, 2025

k8s-ci-robot requested review from mimowo and tenzen-y January 8, 2025 03:00

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 8, 2025

kerthcet force-pushed the feat/affinity-assignment branch 2 times, most recently from 38a3700 to 4b49b46 Compare January 8, 2025 03:25

Add affinity assignment for TAS

ba85310

Signed-off-by: kerthcet <[email protected]>

kerthcet force-pushed the feat/affinity-assignment branch from 4b49b46 to ba85310 Compare January 10, 2025 06:55

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 10, 2025

PBundyra reviewed Jan 14, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2025

mimowo reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: affinity assignment with TAS #3941

Feat: affinity assignment with TAS #3941

kerthcet commented Jan 8, 2025 •

edited

Loading

k8s-ci-robot commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

kerthcet commented Jan 8, 2025

netlify bot commented Jan 8, 2025 •

edited

Loading

kerthcet commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

kerthcet commented Jan 8, 2025

kerthcet commented Jan 8, 2025

kerthcet commented Jan 10, 2025

kerthcet commented Jan 14, 2025

PBundyra Jan 14, 2025

kerthcet Jan 15, 2025

kerthcet Jan 15, 2025

mimowo Jan 15, 2025

kerthcet Jan 16, 2025 •

edited

Loading

kerthcet Jan 16, 2025

k8s-ci-robot commented Jan 14, 2025

mimowo Jan 15, 2025

mimowo Jan 15, 2025

		// lex sort domains by their levelValues instead of IDs, as leaves' IDs can only contain the hostname
		slices.SortFunc(domains, func(a, b *domain) int {

Feat: affinity assignment with TAS #3941

Are you sure you want to change the base?

Feat: affinity assignment with TAS #3941

Conversation

kerthcet commented Jan 8, 2025 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

kerthcet commented Jan 8, 2025

netlify bot commented Jan 8, 2025 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

kerthcet commented Jan 8, 2025

k8s-ci-robot commented Jan 8, 2025

kerthcet commented Jan 8, 2025

kerthcet commented Jan 8, 2025

kerthcet commented Jan 10, 2025

kerthcet commented Jan 14, 2025

PBundyra Jan 14, 2025

Choose a reason for hiding this comment

kerthcet Jan 15, 2025

Choose a reason for hiding this comment

kerthcet Jan 15, 2025

Choose a reason for hiding this comment

mimowo Jan 15, 2025

Choose a reason for hiding this comment

kerthcet Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

kerthcet Jan 16, 2025

Choose a reason for hiding this comment

k8s-ci-robot commented Jan 14, 2025

mimowo Jan 15, 2025

Choose a reason for hiding this comment

mimowo Jan 15, 2025

Choose a reason for hiding this comment

kerthcet commented Jan 8, 2025 •

edited

Loading

netlify bot commented Jan 8, 2025 •

edited

Loading

kerthcet Jan 16, 2025 •

edited

Loading