-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zonal Volume Requirements Break Topology Spread Constraints #1239
Comments
/assign @jmdeal |
/triage accepted |
Any idea when this might be fixed? This prevents us from adopting Karpenter. |
@jmdeal Any update on this issue as to when can we get resolution to this issue? We are already using karpenter with 3 replicas for statefulset but plannig to go with 5 replicas in near future |
Also running into this. Has there been any discussions on how this should be implemented? I'm motivated in contributing a fix but would like to make sure it aligns. |
@jmdeal Any plan about this issue? |
Description
Background
Over the past few months we've received a couple of issues regarding TopologySpreadConstraints with StatefulSets in the aws-provider repository:
In both cases the observed behavior was that Karpenter provisioned nodes as expected initially, but after scaling down and back up Karpenter either provisions nodes that violate* the TopologySpreadConstraint or Karpenter is unable to provision new nodes (allegedly due to TSC violation).
* I use violate loosely here, the TSC may be respected from kube-scheduler's perspective but Karpenter does not spread pods across all domains it is aware of.
Reproduction and Validation
Reproducing this issue is relatively straightforward. The following NodePools and StatefulSet were used:
NodePools:
StatefulSet:
Each NodePool specifies a different capacity type, spot and on-demand. Additionally, each NodePool specifies a
capacity-type
label with three possible values (1, 2, or 3). The StatefulSet defines a TopologySpreadConstraint on this label. This is meant to ensure that 2/3 of the pods schedule to on-demand nodes and the remaining 1/3 schedule to spot nodes.Follow these steps to reproduce:
On the initial scale-up, Karpenter correctly provisions nodes. Two on-demand nodes and one spot node, each in a different AZ (due to another TSC on the StatefulSet). When we scale-down and scale back up, Karpenter provisions three spot nodes, each in a different AZ. The pods are able to schedule, since kube-scheduler is only aware of the single domain (3), but this isn't what we expect Karpenter to do.
While in that previous scenario the pods were able to schedule, we can create a scenario where Karpenter gets stuck and pods are left pending. Consider the following node distribution:
We now scale up our StatefulSet to 3 replicas, and pods schedule to the spot node in zone A and the od node in zone B. The remaining pod remains pending while Karpenter continues to nominate it for the spot node in zone C. This would violate the TopologySpreadConstraint with a skew of 2 between domains
2
and3
.Diagnosis
During scheduling, Karpenter not only needs to consider requirements on the pod itself, but also requirements on volumes for the pod. This typically comes in the form of an additional zonal requirement. Internally, Karpenter models this by adding a zonal node affinity term to the pod (in-memory only). Unfortunately, this causes issues for TopologySpreadConstraints.
When computing skew across domains, only nodes that match a pod's NodeAffinity are counted by default. Since we inject a NodeAffinity term onto the pod after discovering it's volume requirements, this limits the nodes included when computing skew to the nodes in that volume's zone. This lines up with the behavior seen in reproduction. In the first case, where all nodes brought up in the second scale-up were spot, we only counted the single node in each zone for each pod so we believed there was a skew of 0 for each pods TSC. There's a similar story in the second example where we believed there was a skew of 1 for the pod in zone A, 0 in B, and 0 in C.
The issue laid out in aws/karpenter-provider-aws#6141 is a little bit different in terms of symptoms (unable to provision), but the same zonal injection behavior is to blame. In that issue there were 5 pods in the StatefulSet, each with zonal and hostname anti-affinity. When the StatefulSet was scaled to 3 pods and back to 5, Karpenter failed to provision additional nodes believing TSC would be violated. Since Karpenter is only counting nodes in the same zone as the volume when computing skew, and it is aware of all zones, it believes it must create a node in a different zone to fulfill the TSC requirement. Since this is not possible, we fail to provision.
Versions:
kubectl version
): 1.29The text was updated successfully, but these errors were encountered: