-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu driver is in init state after rebooting the gpu node #566
Comments
@alloydm thanks for reporting this. When |
@alloydm there are couple of ways this can be mitigated.
We will add a fix to avoid |
@shivamerla I am attaching kubernetes doc information on why this is happening We are not hitting this issue with upgrade as there is option in driver env to forcefully deleting user gpu pod. can we have that forcefully deleting user gpu pod env here too? |
1. Quick Debug Information
2. Issue or feature description
I have a kubernetes cluster with gpu operator installer (23.3.2) on Tesla p4 gpu node, I am running kubeflow based jupyter notebook which consumes gpu node. This kubeflow based jupyter notebook pod(statefulset as replication controller) also has Persistent volume claims attached to it.
Whenever the gpu node is rebooted, the driver-daemonset pod stucks in init stage, that is the k8s-driver-manager (container) will be stuck in evicting the kubeflow jupyter notebook pod, only when we forecfully delete the notebook pod, the driver daeomonset goes ahead with execution
kubectl delete pod juypter-nb --force --grace-period=0
I have attached the k8s-driver-manager container's environmental variables that I have set
3. Steps to reproduce the issue
4. Information to attach (optional if deemed irrelevant)
kubectl get po -n gpu-operator
kubectl logs nvidia-driver-daemonset-stmk7 -n gpu-operator -f -c k8s-driver-manager
kubectl get pod -n admin
The text was updated successfully, but these errors were encountered: