Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zonal Volume Requirements Break Topology Spread Constraints #1239

Open
jmdeal opened this issue May 7, 2024 · 6 comments · May be fixed by #1907
Open

Zonal Volume Requirements Break Topology Spread Constraints #1239

jmdeal opened this issue May 7, 2024 · 6 comments · May be fixed by #1907
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jmdeal
Copy link
Member

jmdeal commented May 7, 2024

Description

Background

Over the past few months we've received a couple of issues regarding TopologySpreadConstraints with StatefulSets in the aws-provider repository:

In both cases the observed behavior was that Karpenter provisioned nodes as expected initially, but after scaling down and back up Karpenter either provisions nodes that violate* the TopologySpreadConstraint or Karpenter is unable to provision new nodes (allegedly due to TSC violation).

* I use violate loosely here, the TSC may be respected from kube-scheduler's perspective but Karpenter does not spread pods across all domains it is aware of.

Reproduction and Validation

Reproducing this issue is relatively straightforward. The following NodePools and StatefulSet were used:

NodePools:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: on-demand-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ['c', 'm', 'r']
        - key: karpenter.sh/capacity-type
          operator: In
          values: ['on-demand']
        # Virtual capacity-spread domain, 2/3 of nodes should launch on on-demand
        - key: capacity-spread
          operator: In
          values: ['1', '2']
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-np
spec:
  disruption:
    consolidateAfter: 10m
    consolidationPolicy: WhenEmpty
    expireAfter: Never
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.k8s.aws/instance-category
        operator: In
        values: ['c', 'm', 'r']
      - key: karpenter.sh/capacity-type
        operator: In
        values: ['spot']
      # Virtual capacity-spread domain, 1/3 of nodes should launch on spot
      - key: capacity-spread
        operator: In
        values: ['3']

StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: repro
spec:
  replicas: 3
  selector:
    matchLabels:
      app.kubernetes.io/name: repro
  template:
    metadata:
      labels:
        app.kubernetes.io/name: repro
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - topologyKey: kubernetes.io/hostname
              labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: repro
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 100m
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: capacity-spread
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: "repro"
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      labels:
        app.kubernetes.io/name: repro
      name: data
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      volumeMode: Filesystem

Each NodePool specifies a different capacity type, spot and on-demand. Additionally, each NodePool specifies a capacity-type label with three possible values (1, 2, or 3). The StatefulSet defines a TopologySpreadConstraint on this label. This is meant to ensure that 2/3 of the pods schedule to on-demand nodes and the remaining 1/3 schedule to spot nodes.

Follow these steps to reproduce:

  • Apply the NodePools and StatefulSet to your cluster
  • Wait for the StatefulSet to finish deploying (all volumes must exist)
  • Scale the StatefulSet to 0
  • Scale the StatefulSet back up to 3

On the initial scale-up, Karpenter correctly provisions nodes. Two on-demand nodes and one spot node, each in a different AZ (due to another TSC on the StatefulSet). When we scale-down and scale back up, Karpenter provisions three spot nodes, each in a different AZ. The pods are able to schedule, since kube-scheduler is only aware of the single domain (3), but this isn't what we expect Karpenter to do.

While in that previous scenario the pods were able to schedule, we can create a scenario where Karpenter gets stuck and pods are left pending. Consider the following node distribution:

Zone Nodes (capacity-type value)
A spot-3, od-2
B od-1
C spot-3

We now scale up our StatefulSet to 3 replicas, and pods schedule to the spot node in zone A and the od node in zone B. The remaining pod remains pending while Karpenter continues to nominate it for the spot node in zone C. This would violate the TopologySpreadConstraint with a skew of 2 between domains 2 and 3.

Diagnosis

During scheduling, Karpenter not only needs to consider requirements on the pod itself, but also requirements on volumes for the pod. This typically comes in the form of an additional zonal requirement. Internally, Karpenter models this by adding a zonal node affinity term to the pod (in-memory only). Unfortunately, this causes issues for TopologySpreadConstraints.

When computing skew across domains, only nodes that match a pod's NodeAffinity are counted by default. Since we inject a NodeAffinity term onto the pod after discovering it's volume requirements, this limits the nodes included when computing skew to the nodes in that volume's zone. This lines up with the behavior seen in reproduction. In the first case, where all nodes brought up in the second scale-up were spot, we only counted the single node in each zone for each pod so we believed there was a skew of 0 for each pods TSC. There's a similar story in the second example where we believed there was a skew of 1 for the pod in zone A, 0 in B, and 0 in C.

The issue laid out in aws/karpenter-provider-aws#6141 is a little bit different in terms of symptoms (unable to provision), but the same zonal injection behavior is to blame. In that issue there were 5 pods in the StatefulSet, each with zonal and hostname anti-affinity. When the StatefulSet was scaled to 3 pods and back to 5, Karpenter failed to provision additional nodes believing TSC would be violated. Since Karpenter is only counting nodes in the same zone as the volume when computing skew, and it is aware of all zones, it believes it must create a node in a different zone to fulfill the TSC requirement. Since this is not possible, we fail to provision.

Versions:

  • Chart Version: 0.36.1
  • Kubernetes Version (kubectl version): 1.29
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jmdeal jmdeal added the kind/bug Categorizes issue or PR as related to a bug. label May 7, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 7, 2024
@jmdeal
Copy link
Member Author

jmdeal commented May 7, 2024

/assign @jmdeal

@jmdeal
Copy link
Member Author

jmdeal commented May 7, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 7, 2024
@vadasambar
Copy link
Member

Any idea when this might be fixed? This prevents us from adopting Karpenter.

@hrlimaye
Copy link

@jmdeal Any update on this issue as to when can we get resolution to this issue? We are already using karpenter with 3 replicas for statefulset but plannig to go with 5 replicas in near future

@jukie
Copy link

jukie commented Oct 28, 2024

Also running into this. Has there been any discussions on how this should be implemented? I'm motivated in contributing a fix but would like to make sure it aligns.

@leoryu
Copy link

leoryu commented Jan 6, 2025

@jmdeal Any plan about this issue?

@leoryu leoryu linked a pull request Jan 9, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants