You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
When adjusting the spec of an etcd resource, Gardener waits for the change to be rolled out completely by consulting the status (see CheckEtcdObject).
Since Etcd-Druid changed to a multi-stage status update (probably in v0.23.0), those checks have become racy.
Please consider the following example (only relevant steps are listed):
Druid updates status.ready and similar fields based on backing Statefulset.
Between step 5. and 6. is no hint in the resource that point towards an ongoing rollout of a spec change. Controllers/clients might accidentally continue their operations. This lately happened in the scope of credentials rotation for gardener/gardener where the new Peer CA Bundle was not completely rolled out to all replicas, but Gardener already continued with the rotation which led to a certificate mismatch in the etcd cluster.
What you expected to happen:
Clients to know when a spec rollout is successfully finished.
After a discussion between myself and @unmarshall , we propose the following solution:
Introduce a new condition called MembersUpdated in the Etcd status, which tells consumers whether all spec changes were rolled out to all etcd members. This allows consumers to wait until the rollout is complete, and in conjunction with using the AllMembersReady condition, should be able to deterministically wait for the Etcd spec changes to be fully reflected and the etcd cluster to be ready.
Consider the case where the Etcd Spec.Replicas is updated (increased), along with another spec change which causes the sts pod template to be changed. In such a scenario, two new replicas will be launched with the new spec, and only after they are ready will the first pod be recreated with the updated spec. While the two new members are coming up and not yet ready, AllMembersReady condition is set to False. But as soon as the two new members are ready, the condition becomes True, even while the first pod is being recreated and is not yet ready. This is due to the fact that status is updated once every few seconds, determined by the etcd-status-sync-period config flag. This eventual consistency could cause consumers to assume that all the members are ready, even while the first member is getting rolled out and not yet ready.
With the newly proposed MembersUpdated condition, which is set to False at the beginning of the spec reconciliation, we can be sure that it will be set to True only after the first pod is also updated. Thus, using these two conditions in conjunction helps correctly determine whether the cluster is truly updated and ready.
How to categorize this issue?
/area control-plane
/kind bug
What happened:
When adjusting the spec of an
etcd
resource, Gardener waits for the change to be rolled out completely by consulting thestatus
(see CheckEtcdObject).Since Etcd-Druid changed to a multi-stage status update (probably in v0.23.0), those checks have become racy.
Please consider the following example (only relevant steps are listed):
Secret
reference is updated.Steps in
reconcileSpec
:StatefulSet
.status.observedGeneration
.status.lastOperation
andstatus.lastErrors
gardener.cloud/operation
annotation.Steps in
reconcileStatus
:status.ready
and similar fields based on backingStatefulset
.Between step 5. and 6. is no hint in the resource that point towards an ongoing rollout of a spec change. Controllers/clients might accidentally continue their operations. This lately happened in the scope of credentials rotation for
gardener/gardener
where the new Peer CA Bundle was not completely rolled out to all replicas, but Gardener already continued with the rotation which led to a certificate mismatch in the etcd cluster.What you expected to happen:
Clients to know when a spec rollout is successfully finished.
/cc @shreyas-s-rao @LucaBernstein @dguendisch @hendrikKahl
Environment:
kubectl version
):The text was updated successfully, but these errors were encountered: