-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion][TAS] Best effort placements for pods in lower tier of topology #3887
Comments
cc @mimowo |
IIUC the best you can do with the current API is to use Suggestion: you can also have the |
But they're not guaranteed under the same rdma domain right, if insufficient resource in the rdma domain.
Can you elaborate more on this? I remember the algo is searching the topo from bottom to up. I'm still reading the codes now. |
Yes, not guaranteed, but on prod system with many Jobs competing for resources I believe using "required" is hard - maybe for a hero Job. Maybe "required" will work better when we start supporting preemptions with TAS, but this is for future.
There are two BFS passes :
kueue/pkg/cache/tas_flavor_snapshot.go Lines 215 to 226 in b5e745a
EDIT: Also, If you use |
When using If sufficient quota exists within a single RDMA domain, the TAS will schedule all pods within that domain. However, this might utilize more than 4 supernodes. If insufficient quota exists, the job will not be scheduled. When using If sufficient quota exists within a single RDMA domain, TAS will schedule all pods within that domain. Otherwise, it will distribute the pods across multiple RDMA domains. |
I checked the code @mimowo referred to, IIUC, when set the policy to kueue/pkg/cache/tas_flavor_snapshot.go Lines 294 to 312 in b5e745a
|
I tested with pytorchjob like this:
And I got unexpected results as the status telling:
The master and the worker located to the same node, kwok-node-10 specifically, I think this is not as expected. Whats' more, since the topology constraint applies to the whole job, why not set the annotation at the job level rather than the spec level. Happen xmas day anyway! 🎄 |
This is how we test: https://github.com/kerthcet/benchmark/tree/main/kueue-tas. |
Seems we didn't update the leaf freeCapacity once former podsets assignedTopology. |
Thank you for providing detailed explanation @kerthcet and thanks for wishes!
Could you elaborate why is it unexpected please? Does one node have 32GPUs? In the link you have provided you set GPUs capacity to 8 per node, but I imagine this configuration is outdated - please correct me if I'm wrong. In case there are 32GPUs per node, I believe this assignment is as intended |
Yes, the algorithm is greedy and minimizes the number of domains at each level. If there is enough capacity in one supernode, then only one supernode will be used |
No, it's a super node which has 4 nodes and 4*8=32 GPUs as an unit. These four nodes has better network connecting together, you could take it as a NVL or suprePod. |
Take the status for example:
The master podset assignment will be located to kwok-node-10 and the first assignment of worker will also be located to kwok-node-10, which is not right. I'll take a deep look of the code these days. |
@kerthcet Can you run one more scenario where the master section is defined after the worker section? I wonder if this is going to make any change at all. |
Sorry, what do you mean |
In the PyTorchJob specification:
|
If I understand correctly, are you referring to the role order? status:
admission:
clusterQueue: tas-cluster-queue
podSetAssignments:
- count: 1
flavors:
nvidia.com/gpu: tas-flavor
name: master
resourceUsage:
nvidia.com/gpu: "8"
topologyAssignment:
domains:
- count: 1
values:
- node-0201
levels:
- kubernetes.io/hostname
- count: 15
flavors:
nvidia.com/gpu: tas-flavor
name: worker
resourceUsage:
nvidia.com/gpu: "120"
topologyAssignment:
domains:
- count: 1
values:
- node-0201
- count: 1
values:
- node-0202
- count: 1
values:
- node-0203
- count: 1
values:
- node-0204
- count: 1
values:
- node-0205
- count: 1
values:
- node-0206
- count: 1
values:
- node-0207
- count: 1
values:
- node-0208
- count: 1
values:
- node-0209
- count: 1
values:
- node-0210
- count: 1
values:
- node-0211
- count: 1
values:
- node-0212
- count: 1
values:
- node-0213
- count: 1
values:
- node-0214
- count: 1
values:
- node-0215
levels:
- kubernetes.io/hostname PytorchJob: apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
labels:
kueue.x-k8s.io/queue-name: tas-local-queue
name: job-eowu-15
namespace: tas
spec:
pytorchReplicaSpecs:
Worker:
replicas: 15
restartPolicy: OnFailure
template:
metadata:
annotations:
kueue.x-k8s.io/podset-preferred-topology: supernode
kueue.x-k8s.io/workload: pytorchjob-job-eowu-15-1d286
labels:
kueue.x-k8s.io/podset: worker
kueue.x-k8s.io/tas: "true"
spec:
containers:
- command:
- python3
- /opt/pytorch-mnist/mnist.py
- --epochs=1
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
imagePullPolicy: Always
name: pytorch
resources:
limits:
nvidia.com/gpu: "8"
nodeSelector:
topology-key/zone: zone1
schedulingGates:
- name: kueue.x-k8s.io/topology
tolerations:
- effect: NoSchedule
key: kwok.x-k8s.io/node
operator: Exists
Master:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
kueue.x-k8s.io/podset-preferred-topology: supernode
kueue.x-k8s.io/workload: pytorchjob-job-eowu-15-1d286
labels:
kueue.x-k8s.io/podset: master
kueue.x-k8s.io/tas: "true"
spec:
containers:
- command:
- python3
- /opt/pytorch-mnist/mnist.py
- --epochs=1
image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
imagePullPolicy: Always
name: pytorch
resources:
limits:
nvidia.com/gpu: "8"
nodeSelector:
topology-key/zone: zone1
schedulingGates:
- name: kueue.x-k8s.io/topology
tolerations:
- effect: NoSchedule
key: kwok.x-k8s.io/node
operator: Exists |
Thanks for providing such a detailed explanation. We'll look into it |
@kerthcet thank you for reporting the issue, I believe there is a bug indeed - the inflight usage from previously considered PodSets is not taken into account when computing placement for the new PodSet. I will leave more comments at your PR. |
What would you like to be added:
We have a rough topology like this: node -> superNode -> RDMA domain -> Zone.
We want to build the topology like:
So if we want to deploy a job with 4 replicas, each requests 8 GPUs, we can simply set the annotation like
then we'll find a supernode.
However, if we want to deploy a 16 replica job, which means 4 superNodes, we have to set the job annotation like
because obviously a supernode is not fit. Then comes the question, is there any way to make sure the 16 pods are colocated within 4 superNodes, rather than 16 pods in 8 superNodes which means fragmentation.
I haven't tried with TAS yet, just ask this question ahead. And what scheduling instructions will be injected?
Ticketed the issue here just in case others have similar questions.
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: