Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible incompatibility with cpumanager, memorymanager, or topologymanager. #455

Open
benlsheets opened this issue Dec 2, 2022 · 3 comments

Comments

@benlsheets
Copy link

1. Quick Debug Checklist

Ubuntu 22.04.1
k8s 1.25.4 - via kubeadm
containerd.io 1.6.8 - held to avoid #432

1. Issue or feature description

Kubelets configured with cpumanager, memorymanager, and topologymanager enabled. Everything works initially but at some point following a reboot and/or kernel upgrade on GPU nodes the kubelet fails to start with "Memory states for the NUMA node and resource are different" and "Invalid state, please drain node and remove policy state file"

2. Steps to reproduce the issue

I use kubeadm to bring up the cluster with cpumanager, memorymanager, and topologymanager enabled in the kubeletconfiguration. I use the helm chart to install the network-operator. I use the helm chart to install the gpu-operator. I've run into this issue with and without the network-operator installed.

3. Information to attach (optional if deemed irrelevant)

kubectl.txt
kubelet.log

I know there's a lot going on here and most of it is unrelated to the gpu-operator. I only have this problem on nodes with a GPU installed though. Let me know if you decide to dig into it and if I can help.

Anybody reading this with the same issue try rm /var/lib/kubelet/memory_manager_state on the affected node. That at least gets the kubelet back up and running.

@shivamerla
Copy link
Contributor

cc @klueska

@benlsheets
Copy link
Author

Update:
I think this might be caused by memory reserved by the nouveau driver before it's disabled by the operator. Blacklisting the Nouveau driver seems to fix this issue.

@klueska
Copy link
Contributor

klueska commented Jan 3, 2023

That seems like a reasonable explanation to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants