Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

benlsheets · 2022-12-02T17:22:52Z

1. Quick Debug Checklist

Ubuntu 22.04.1
k8s 1.25.4 - via kubeadm
containerd.io 1.6.8 - held to avoid #432

1. Issue or feature description

Kubelets configured with cpumanager, memorymanager, and topologymanager enabled. Everything works initially but at some point following a reboot and/or kernel upgrade on GPU nodes the kubelet fails to start with "Memory states for the NUMA node and resource are different" and "Invalid state, please drain node and remove policy state file"

2. Steps to reproduce the issue

I use kubeadm to bring up the cluster with cpumanager, memorymanager, and topologymanager enabled in the kubeletconfiguration. I use the helm chart to install the network-operator. I use the helm chart to install the gpu-operator. I've run into this issue with and without the network-operator installed.

3. Information to attach (optional if deemed irrelevant)

kubectl.txt
kubelet.log

I know there's a lot going on here and most of it is unrelated to the gpu-operator. I only have this problem on nodes with a GPU installed though. Let me know if you decide to dig into it and if I can help.

Anybody reading this with the same issue try rm /var/lib/kubelet/memory_manager_state on the affected node. That at least gets the kubelet back up and running.

The text was updated successfully, but these errors were encountered:

shivamerla · 2022-12-09T17:55:23Z

cc @klueska

benlsheets · 2022-12-31T17:47:03Z

Update:
I think this might be caused by memory reserved by the nouveau driver before it's disabled by the operator. Blacklisting the Nouveau driver seems to fix this issue.

klueska · 2023-01-03T10:49:05Z

That seems like a reasonable explanation to me.

xhejtman mentioned this issue Dec 5, 2022

Failed to initialize NVML: Unknown Error #430

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

benlsheets commented Dec 2, 2022

shivamerla commented Dec 9, 2022

benlsheets commented Dec 31, 2022

klueska commented Jan 3, 2023

Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

Comments

benlsheets commented Dec 2, 2022

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

shivamerla commented Dec 9, 2022

benlsheets commented Dec 31, 2022

klueska commented Jan 3, 2023