Zonal Volume Requirements Break Topology Spread Constraints #1239

jmdeal · 2024-05-07T06:00:42Z

Description

Background

Over the past few months we've received a couple of issues regarding TopologySpreadConstraints with StatefulSets in the aws-provider repository:

In both cases the observed behavior was that Karpenter provisioned nodes as expected initially, but after scaling down and back up Karpenter either provisions nodes that violate* the TopologySpreadConstraint or Karpenter is unable to provision new nodes (allegedly due to TSC violation).

* I use violate loosely here, the TSC may be respected from kube-scheduler's perspective but Karpenter does not spread pods across all domains it is aware of.

Reproduction and Validation

Reproducing this issue is relatively straightforward. The following NodePools and StatefulSet were used:

NodePools:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: on-demand-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        # Virtual capacity-spread domain, 2/3 of nodes should launch on on-demand
        - key: capacity-spread
          operator: In
          values: ['1', '2']
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ['c', 'm', 'r']
      - key: karpenter.sh/capacity-type
        operator: In
        values: ['spot']
      # Virtual capacity-spread domain, 1/3 of nodes should launch on spot
      - key: capacity-spread
        operator: In
        values: ['3']

StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: repro
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: repro
  template:
    metadata:
      labels:
        app.kubernetes.io/name: repro
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: kubernetes.io/hostname
              labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: repro
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 100m
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: capacity-spread
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      labels:
        app.kubernetes.io/name: repro
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      volumeMode: Filesystem

Each NodePool specifies a different capacity type, spot and on-demand. Additionally, each NodePool specifies a capacity-type label with three possible values (1, 2, or 3). The StatefulSet defines a TopologySpreadConstraint on this label. This is meant to ensure that 2/3 of the pods schedule to on-demand nodes and the remaining 1/3 schedule to spot nodes.

Follow these steps to reproduce:

Apply the NodePools and StatefulSet to your cluster
Wait for the StatefulSet to finish deploying (all volumes must exist)
Scale the StatefulSet to 0
Scale the StatefulSet back up to 3

On the initial scale-up, Karpenter correctly provisions nodes. Two on-demand nodes and one spot node, each in a different AZ (due to another TSC on the StatefulSet). When we scale-down and scale back up, Karpenter provisions three spot nodes, each in a different AZ. The pods are able to schedule, since kube-scheduler is only aware of the single domain (3), but this isn't what we expect Karpenter to do.

While in that previous scenario the pods were able to schedule, we can create a scenario where Karpenter gets stuck and pods are left pending. Consider the following node distribution:

Zone	Nodes (capacity-type value)
A	spot-3, od-2
B	od-1
C	spot-3

We now scale up our StatefulSet to 3 replicas, and pods schedule to the spot node in zone A and the od node in zone B. The remaining pod remains pending while Karpenter continues to nominate it for the spot node in zone C. This would violate the TopologySpreadConstraint with a skew of 2 between domains 2 and 3.

Diagnosis

During scheduling, Karpenter not only needs to consider requirements on the pod itself, but also requirements on volumes for the pod. This typically comes in the form of an additional zonal requirement. Internally, Karpenter models this by adding a zonal node affinity term to the pod (in-memory only). Unfortunately, this causes issues for TopologySpreadConstraints.

When computing skew across domains, only nodes that match a pod's NodeAffinity are counted by default. Since we inject a NodeAffinity term onto the pod after discovering it's volume requirements, this limits the nodes included when computing skew to the nodes in that volume's zone. This lines up with the behavior seen in reproduction. In the first case, where all nodes brought up in the second scale-up were spot, we only counted the single node in each zone for each pod so we believed there was a skew of 0 for each pods TSC. There's a similar story in the second example where we believed there was a skew of 1 for the pod in zone A, 0 in B, and 0 in C.

The issue laid out in aws/karpenter-provider-aws#6141 is a little bit different in terms of symptoms (unable to provision), but the same zonal injection behavior is to blame. In that issue there were 5 pods in the StatefulSet, each with zonal and hostname anti-affinity. When the StatefulSet was scaled to 3 pods and back to 5, Karpenter failed to provision additional nodes believing TSC would be violated. Since Karpenter is only counting nodes in the same zone as the volume when computing skew, and it is aware of all zones, it believes it must create a node in a different zone to fulfill the TSC requirement. Since this is not possible, we fail to provision.

Versions:

Chart Version: 0.36.1
Kubernetes Version (kubectl version): 1.29

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jmdeal · 2024-05-07T06:02:52Z

/assign @jmdeal

jmdeal · 2024-05-07T06:06:16Z

/triage accepted

vadasambar · 2024-09-17T11:06:34Z

Any idea when this might be fixed? This prevents us from adopting Karpenter.

hrlimaye · 2024-10-24T04:25:36Z

@jmdeal Any update on this issue as to when can we get resolution to this issue? We are already using karpenter with 3 replicas for statefulset but plannig to go with 5 replicas in near future

jukie · 2024-10-28T21:44:02Z

Also running into this. Has there been any discussions on how this should be implemented? I'm motivated in contributing a fix but would like to make sure it aligns.

leoryu · 2025-01-06T11:06:22Z

@jmdeal Any plan about this issue?

jmdeal added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 7, 2024

k8s-ci-robot assigned jmdeal May 7, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 7, 2024

vadasambar mentioned this issue Oct 22, 2024

Karpenter will occassionally provision nodes that are way too large #1762

Open

leoryu linked a pull request Jan 9, 2025 that will close this issue

fix: Fix topology spread constraints with zonal volume #1907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zonal Volume Requirements Break Topology Spread Constraints #1239

Zonal Volume Requirements Break Topology Spread Constraints #1239

jmdeal commented May 7, 2024

jmdeal commented May 7, 2024

jmdeal commented May 7, 2024

vadasambar commented Sep 17, 2024

hrlimaye commented Oct 24, 2024

jukie commented Oct 28, 2024

leoryu commented Jan 6, 2025

Zonal Volume Requirements Break Topology Spread Constraints #1239

Zonal Volume Requirements Break Topology Spread Constraints #1239

Comments

jmdeal commented May 7, 2024

Description

Background

Reproduction and Validation

Diagnosis

jmdeal commented May 7, 2024

jmdeal commented May 7, 2024

vadasambar commented Sep 17, 2024

hrlimaye commented Oct 24, 2024

jukie commented Oct 28, 2024

leoryu commented Jan 6, 2025