-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04 #504
Comments
Using docker image: |
And |
Using operator |
Upgraded gpu-operator to the latest version
Shouldn't it be installed without compiling? |
Swap is used, it's aroun 12gb memory used to compile. That's a lot, should i be like that? I know that it's not gpu-operator project code, just asking around. Looks like something is off there. |
It is failing as well, here is full log of driver manager and driver installer:
It is retrying that installation and always failing. Will try to install it on Ubuntu latest non-LTS |
ok, interesting, but with Ubuntu 22.10 it worked:
Even without driver installation, it was already there. I have checked Vultr logs when node was creating, they are pre-installing drivers before letting in. |
There is now another problem, now with toolkit:
|
Logs from gpu-operator pod:
|
All others pods are failing to start due this error:
They just stuck in PodInitializing state:
|
It seems that it is due to runtime toolkit is unsupported for Ubuntu 22.10, only Ubuntu 22.04. But where is that error, which states this and fails to install runtime? |
Ok, so one problem was found, it's that docker image was trying to call Helped to solve one problem, then after all validations have passed and all pods became green. I decided to restart all pods and it failed again on toolkit pod:
|
After a few containerd restart and killing toolkit pod, i have managed to make it work... Very strange behavior...
|
Why it is shutting down before these lines?
like here:
|
It failed to restart containerd? Why there is no error then? How "Successfully signaled containerd" is verified? |
Ok, so linking issue is unrelated, it just can't restart containerd by sending
|
Node is using: |
Ok, after digging into it, i have found it in sources: It seems that it is not selecting systemd switch case, but is trying to signal it as if it is running as standalone daemon, without init system wrapper. |
Because it is defaulted here https://github.com/NVIDIA/nvidia-container-toolkit/blob/main/tools/container/containerd/containerd.go#L49 |
I don't see any setting in helm chart options, to specify of a method for containerd restart. |
I see that there is an env variable for that option, which is called |
Yeah... When i have added this env to a daemonset, it started working properly and without errors.
It would be much better if you try both variants or identify how containerd is started on that node. It shouldn't be that hard to identify, just query systemd for that service and its status, if both exist, then use systemd, else just pretend that there is no systemd or containerd systemd service and restart it as a standalone daemon. Hope this helps everyone else, who will find the same problem in the future. |
@denissabramovs even when run as a |
@denissabramovs - Did you manage to install the driver using operator successfully or relied on the pre-installed driver on the node by disabling it ? I think you disabled the driver via operator but wanted to double check as I am facing the similar issue. (Thank you for the detailed updates, it is helping for sure). |
I just ran into a similar problem. For me, the driver was not installed at all. Checking the labels, there was one that said:
|
Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04.
Logs are taken from nvidia driver installation daemonset's pod
nvidia-driver-daemonset-srf9k
:I'm more concerned about this:
I have checked, we have gcc installed on that machine and it is actually exactly gcc 11.3.0:
and:
The text was updated successfully, but these errors were encountered: