Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion][TAS] Best effort placements for pods in lower tier of topology #3887

Open
3 tasks
kerthcet opened this issue Dec 19, 2024 · 19 comments · May be fixed by #3937
Open
3 tasks

[Discussion][TAS] Best effort placements for pods in lower tier of topology #3887

kerthcet opened this issue Dec 19, 2024 · 19 comments · May be fixed by #3937
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@kerthcet
Copy link
Contributor

kerthcet commented Dec 19, 2024

What would you like to be added:

We have a rough topology like this: node -> superNode -> RDMA domain -> Zone.

image

We want to build the topology like:

  levels:
  - nodeLabel: "cloud.provider.com/topology-zone"
  - nodeLabel: "cloud.provider.com/topology-rdma-domain"
  - nodeLabel: "cloud.provider.com/topology-supernode"

So if we want to deploy a job with 4 replicas, each requests 8 GPUs, we can simply set the annotation like

kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-supernode"

then we'll find a supernode.

However, if we want to deploy a 16 replica job, which means 4 superNodes, we have to set the job annotation like

kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rdma-domain"

because obviously a supernode is not fit. Then comes the question, is there any way to make sure the 16 pods are colocated within 4 superNodes, rather than 16 pods in 8 superNodes which means fragmentation.

I haven't tried with TAS yet, just ask this question ahead. And what scheduling instructions will be injected?

Ticketed the issue here just in case others have similar questions.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@kerthcet kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2024
@kerthcet kerthcet changed the title [Discussion] Best effort placements for pods in lower tier of topology [Discussion][TAS] Best effort placements for pods in lower tier of topology Dec 19, 2024
@kerthcet
Copy link
Contributor Author

cc @mimowo

@mimowo
Copy link
Contributor

mimowo commented Dec 19, 2024

IIUC the best you can do with the current API is to use kueue.x-k8s.io/podset-preferred-topology: cloud.provider.com/topology-supernode instead. If the cluster is empty the Job will land on 4 superNodes.
However, if the cluster already has some Jobs running you may overflow to more superNodes.

Suggestion: you can also have the kubernetes.io/hostname level at the bottom, and then even have kueue.x-k8s.io/podset-preferred-topology: hostname.io/nodename. The algorithm will find that the Job fits in one RDMA domain and will step down optimizing the number of superNodes and nodes used.

cc @mwysokin @PBundyra

@kerthcet
Copy link
Contributor Author

kerthcet commented Dec 19, 2024

use kueue.x-k8s.io/podset-preferred-topology: cloud.provider.com/topology-supernode instead. If the cluster is empty the Job will land on 4 superNodes. However, if the cluster already has some Jobs running you may overflow to more superNodes

But they're not guaranteed under the same rdma domain right, if insufficient resource in the rdma domain.

The algorithm will find that the Job fits in one RDMA domain and will step down optimizing the number of superNodes and nodes used.

Can you elaborate more on this? I remember the algo is searching the topo from bottom to up. I'm still reading the codes now.

@mimowo
Copy link
Contributor

mimowo commented Dec 19, 2024

But they're not guaranteed under the same rdma domain right, if insufficient resource in the rdma domain.

Yes, not guaranteed, but on prod system with many Jobs competing for resources I believe using "required" is hard - maybe for a hero Job. Maybe "required" will work better when we start supporting preemptions with TAS, but this is for future.

Can you elaborate more on this?

There are two BFS passes :

  • bottom up starting from the level indicated by the annotation to find the topology domain which fits the entire workload
  • top to bottom starting at the found level down to the last specified level

// Algorithm overview:
// Phase 1:
//
// determine pod counts for each topology domain. Start at the lowest level
// and bubble up the numbers to the top level
//
// Phase 2:
//
// a) select the domain at requested level with count >= requestedCount
// b) traverse the structure down level-by-level optimizing the number of used
// domains at each level
// c) build the assignment for the lowest level in the hierarchy

EDIT: Also, If you use kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rdma-domain" the algorithm will schedule on domain, you are not guarantteed 4 super nodes are used, but the algorithm will optimize the number of super nodes.

@PBundyra
Copy link
Contributor

because obviously a supernode is not fit. Then comes the question, is there any way to make sure the 16 pods are colocated within 4 superNodes, rather than 16 pods in 8 superNodes which means fragmentation.

When using kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rdma-domain":

If sufficient quota exists within a single RDMA domain, the TAS will schedule all pods within that domain. However, this might utilize more than 4 supernodes. If insufficient quota exists, the job will not be scheduled.

When using kueue.x-k8s.io/podset-preferred-topology: "cloud.provider.com/topology-rdma-domain":

If sufficient quota exists within a single RDMA domain, TAS will schedule all pods within that domain. Otherwise, it will distribute the pods across multiple RDMA domains.

@kerthcet
Copy link
Contributor Author

I checked the code @mimowo referred to, IIUC, when set the policy to kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rdma-domain", and let's assume that there's no pods in the cluster yet, then we'll make sure we'll fill the supernode domain one by one right? Because we're searching from the lowest ones and domains are sorted in descend.

sortedDomain := s.sortedDomains(levelDomains)
topDomain := sortedDomain[0]
if topDomain.state < count {
if required {
return 0, nil, s.notFitMessage(topDomain.state, count)
}
if levelIdx > 0 {
return s.findLevelWithFitDomains(levelIdx-1, required, count)
}
lastIdx := 0
remainingCount := count - sortedDomain[lastIdx].state
for remainingCount > 0 && lastIdx < len(sortedDomain)-1 && sortedDomain[lastIdx].state > 0 {
lastIdx++
remainingCount -= sortedDomain[lastIdx].state
}
if remainingCount > 0 {
return 0, nil, s.notFitMessage(count-remainingCount, count)
}
return 0, sortedDomain[:lastIdx+1], ""

@kerthcet
Copy link
Contributor Author

I tested with pytorchjob like this:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-simple
  namespace: tas
  labels:
    kueue.x-k8s.io/queue-name: tas-local-queue
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            kueue.x-k8s.io/podset-required-topology: "topology-key/supernode"
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
#              If you have gpu, pytorch-mnist-gpu would be helpful. pytorch-mnist-gpu is approximately 22GB
#              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
              resources:
                requests:
                  nvidia.com/gpu: 8
          tolerations:
          - key: "kwok.x-k8s.io/node"
            operator: "Exists"
            effect: "NoSchedule"
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            kueue.x-k8s.io/podset-required-topology: "topology-key/supernode"
        spec:
          containers:
            - name: pytorch
              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
#              If you have gpu, pytorch-mnist-gpu would be helpful. pytorch-mnist-gpu is approximately 22GB
#              image: docker.io/kubeflowkatib/pytorch-mnist-cpu:latest
              imagePullPolicy: Always
              command:
                - "python3"
                - "/opt/pytorch-mnist/mnist.py"
                - "--epochs=1"
              resources:
                requests:
                  nvidia.com/gpu: 8
          tolerations:
          - key: "kwok.x-k8s.io/node"
            operator: "Exists"
            effect: "NoSchedule"

And I got unexpected results as the status telling:

status:
  admission:
    clusterQueue: tas-cluster-queue
    podSetAssignments:
    - count: 1
      flavors:
        nvidia.com/gpu: tas-flavor
      name: master
      resourceUsage:
        nvidia.com/gpu: "8"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - kwok-node-10
        levels:
        - kubernetes.io/hostname
    - count: 3
      flavors:
        nvidia.com/gpu: tas-flavor
      name: worker
      resourceUsage:
        nvidia.com/gpu: "24"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - kwok-node-10
        - count: 1
          values:
          - kwok-node-11
        - count: 1
          values:
          - kwok-node-8
        levels:
        - kubernetes.io/hostname

The master and the worker located to the same node, kwok-node-10 specifically, I think this is not as expected.

Whats' more, since the topology constraint applies to the whole job, why not set the annotation at the job level rather than the spec level.

cc @mimowo @PBundyra

Happen xmas day anyway! 🎄

@kerthcet
Copy link
Contributor Author

This is how we test: https://github.com/kerthcet/benchmark/tree/main/kueue-tas.

@kerthcet
Copy link
Contributor Author

The master and the worker located to the same node, kwok-node-10 specifically, I think this is not as expected.

Seems we didn't update the leaf freeCapacity once former podsets assignedTopology.

@PBundyra
Copy link
Contributor

Thank you for providing detailed explanation @kerthcet and thanks for wishes!

The master and the worker located to the same node, kwok-node-10 specifically, I think this is not as expected.

Could you elaborate why is it unexpected please? Does one node have 32GPUs? In the link you have provided you set GPUs capacity to 8 per node, but I imagine this configuration is outdated - please correct me if I'm wrong. In case there are 32GPUs per node, I believe this assignment is as intended

@PBundyra
Copy link
Contributor

I checked the code @mimowo referred to, IIUC, when set the policy to kueue.x-k8s.io/podset-required-topology: "cloud.provider.com/topology-rdma-domain", and let's assume that there's no pods in the cluster yet, then we'll make sure we'll fill the supernode domain one by one right?

Yes, the algorithm is greedy and minimizes the number of domains at each level. If there is enough capacity in one supernode, then only one supernode will be used

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 3, 2025

Could you elaborate why is it unexpected please? Does one node have 32GPUs?

No, it's a super node which has 4 nodes and 4*8=32 GPUs as an unit. These four nodes has better network connecting together, you could take it as a NVL or suprePod.

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 3, 2025

Take the status for example:

status:
  admission:
    clusterQueue: tas-cluster-queue
    podSetAssignments:
    - count: 1
      flavors:
        nvidia.com/gpu: tas-flavor
      name: master
      resourceUsage:
        nvidia.com/gpu: "8"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - kwok-node-10
        levels:
        - kubernetes.io/hostname
    - count: 3
      flavors:
        nvidia.com/gpu: tas-flavor
      name: worker
      resourceUsage:
        nvidia.com/gpu: "24"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - kwok-node-10
        - count: 1
          values:
          - kwok-node-11
        - count: 1
          values:
          - kwok-node-8
        levels:
        - kubernetes.io/hostname

The master podset assignment will be located to kwok-node-10 and the first assignment of worker will also be located to kwok-node-10, which is not right.

I'll take a deep look of the code these days.

@mwysokin
Copy link
Contributor

mwysokin commented Jan 3, 2025

@kerthcet Can you run one more scenario where the master section is defined after the worker section? I wonder if this is going to make any change at all.

@kerthcet
Copy link
Contributor Author

kerthcet commented Jan 3, 2025

Sorry, what do you mean where the master section is defined after the worker section?

@mwysokin
Copy link
Contributor

mwysokin commented Jan 3, 2025

In the PyTorchJob specification:

spec:
  pytorchReplicaSpecs:
    Worker:
      ...
    Master:
      ...

@kebe7jun
Copy link
Contributor

kebe7jun commented Jan 3, 2025

If I understand correctly, are you referring to the role order?
After switching the order, there was still no change (pytorchReplicaSpecs is a map and does not have an order):

status:
  admission:
    clusterQueue: tas-cluster-queue
    podSetAssignments:
    - count: 1
      flavors:
        nvidia.com/gpu: tas-flavor
      name: master
      resourceUsage:
        nvidia.com/gpu: "8"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - node-0201
        levels:
        - kubernetes.io/hostname
    - count: 15
      flavors:
        nvidia.com/gpu: tas-flavor
      name: worker
      resourceUsage:
        nvidia.com/gpu: "120"
      topologyAssignment:
        domains:
        - count: 1
          values:
          - node-0201
        - count: 1
          values:
          - node-0202
        - count: 1
          values:
          - node-0203
        - count: 1
          values:
          - node-0204
        - count: 1
          values:
          - node-0205
        - count: 1
          values:
          - node-0206
        - count: 1
          values:
          - node-0207
        - count: 1
          values:
          - node-0208
        - count: 1
          values:
          - node-0209
        - count: 1
          values:
          - node-0210
        - count: 1
          values:
          - node-0211
        - count: 1
          values:
          - node-0212
        - count: 1
          values:
          - node-0213
        - count: 1
          values:
          - node-0214
        - count: 1
          values:
          - node-0215
        levels:
        - kubernetes.io/hostname

PytorchJob:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  labels:
    kueue.x-k8s.io/queue-name: tas-local-queue
  name: job-eowu-15
  namespace: tas
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 15
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            kueue.x-k8s.io/podset-preferred-topology: supernode
            kueue.x-k8s.io/workload: pytorchjob-job-eowu-15-1d286
          labels:
            kueue.x-k8s.io/podset: worker
            kueue.x-k8s.io/tas: "true"
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                nvidia.com/gpu: "8"
          nodeSelector:
            topology-key/zone: zone1
          schedulingGates:
          - name: kueue.x-k8s.io/topology
          tolerations:
          - effect: NoSchedule
            key: kwok.x-k8s.io/node
            operator: Exists
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            kueue.x-k8s.io/podset-preferred-topology: supernode
            kueue.x-k8s.io/workload: pytorchjob-job-eowu-15-1d286
          labels:
            kueue.x-k8s.io/podset: master
            kueue.x-k8s.io/tas: "true"
        spec:
          containers:
          - command:
            - python3
            - /opt/pytorch-mnist/mnist.py
            - --epochs=1
            image: docker.io/kubeflowkatib/pytorch-mnist-cpu:v1beta1-21320b6
            imagePullPolicy: Always
            name: pytorch
            resources:
              limits:
                nvidia.com/gpu: "8"
          nodeSelector:
            topology-key/zone: zone1
          schedulingGates:
          - name: kueue.x-k8s.io/topology
          tolerations:
          - effect: NoSchedule
            key: kwok.x-k8s.io/node
            operator: Exists

@PBundyra
Copy link
Contributor

PBundyra commented Jan 3, 2025

Thanks for providing such a detailed explanation. We'll look into it

@kerthcet kerthcet linked a pull request Jan 7, 2025 that will close this issue
@mimowo
Copy link
Contributor

mimowo commented Jan 7, 2025

@kerthcet thank you for reporting the issue, I believe there is a bug indeed - the inflight usage from previously considered PodSets is not taken into account when computing placement for the new PodSet. I will leave more comments at your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants