You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ubuntu 22.04.1
k8s 1.25.4 - via kubeadm
containerd.io 1.6.8 - held to avoid #432
1. Issue or feature description
Kubelets configured with cpumanager, memorymanager, and topologymanager enabled. Everything works initially but at some point following a reboot and/or kernel upgrade on GPU nodes the kubelet fails to start with "Memory states for the NUMA node and resource are different" and "Invalid state, please drain node and remove policy state file"
2. Steps to reproduce the issue
I use kubeadm to bring up the cluster with cpumanager, memorymanager, and topologymanager enabled in the kubeletconfiguration. I use the helm chart to install the network-operator. I use the helm chart to install the gpu-operator. I've run into this issue with and without the network-operator installed.
3. Information to attach (optional if deemed irrelevant)
I know there's a lot going on here and most of it is unrelated to the gpu-operator. I only have this problem on nodes with a GPU installed though. Let me know if you decide to dig into it and if I can help.
Anybody reading this with the same issue try rm /var/lib/kubelet/memory_manager_state on the affected node. That at least gets the kubelet back up and running.
The text was updated successfully, but these errors were encountered:
Update:
I think this might be caused by memory reserved by the nouveau driver before it's disabled by the operator. Blacklisting the Nouveau driver seems to fix this issue.
1. Quick Debug Checklist
Ubuntu 22.04.1
k8s 1.25.4 - via kubeadm
containerd.io 1.6.8 - held to avoid #432
1. Issue or feature description
Kubelets configured with cpumanager, memorymanager, and topologymanager enabled. Everything works initially but at some point following a reboot and/or kernel upgrade on GPU nodes the kubelet fails to start with "Memory states for the NUMA node and resource are different" and "Invalid state, please drain node and remove policy state file"
2. Steps to reproduce the issue
I use kubeadm to bring up the cluster with cpumanager, memorymanager, and topologymanager enabled in the kubeletconfiguration. I use the helm chart to install the network-operator. I use the helm chart to install the gpu-operator. I've run into this issue with and without the network-operator installed.
3. Information to attach (optional if deemed irrelevant)
kubectl.txt
kubelet.log
I know there's a lot going on here and most of it is unrelated to the gpu-operator. I only have this problem on nodes with a GPU installed though. Let me know if you decide to dig into it and if I can help.
Anybody reading this with the same issue try
rm /var/lib/kubelet/memory_manager_state
on the affected node. That at least gets the kubelet back up and running.The text was updated successfully, but these errors were encountered: