Machine fails to finish draining/volume detachment after successful completion #11591
Labels
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
What steps did you take and what happened?
After upgrading CAPI to 1.9 we observed an issue with CAPRKE2 provider.
RKE2 uses kubelet local mode by default, so etcd membership management logic behaves as in k/k 1.32 in Kubeadm.
The problem causes loss of API server access after etcd member is removed, leading to inability to proceed with infrastructure machine deletion.
The issue is that in rke2 deployments, kubelet is configured to use local api server (127.0.0.1:443), which in turn relies on local etcd pod. But as this node is removed from etcd cluster, kubelet won't be able to reach the API any more, and it will fail to properly drain the node as all pods will remain stuck in Terminating state from kubernetes perspective.
Logs from the cluster:
What did you expect to happen?
Draining and Volume detachment to succeed, and machine get deleted without issues.
Cluster API version
v1.9.0
Kubernetes version
v1.29.2 - management
v1.31.0 - workload
Anything else you would like to add?
Logs from CI run with all details: https://github.com/rancher/cluster-api-provider-rke2/actions/runs/12372669685/artifacts/2332172988
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: