-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to get sandbox runtime: no runtime for nvidia is configured #432
Comments
|
When i launch nvidia/cuda image via containerd cli, it is correctly detects and outputs my Nvidia GeForce video card, but for some reason, it doesn't see inside pods when deployed via helm. |
Can you run |
I was checking |
|
you can disable toolkit as well by editing |
Can you also paste logs of |
Nope, didn't help. I have updated it, pod was removed and still complaining about:
|
I have removed all pods, to trigger everything from scratch. |
|
here are error from systemd containerd logs: |
Here is updated, latest one: |
at least, now containerd is not constantly restarting, it is already up for 9 minutes:
|
All 3 systemd services are up and running on GPU node:
|
Sorry, missed your message. Here it is:
|
|
@denissabramovs this is a wild guess: are you using containerd 1.6.9? I believe we had problems with this version and the operator. We downgraded to containerd 1.6.8 and things started working again. |
Killed/re-scheduled all pods in gpu-operator namespace after downgrading containerd. |
Oh wow! @wjentner you actually were right, i have re-enabled above mentioned toolkit and after downgrade, it finished without problems and all pods are up and running now! |
Good that i have captured both logs @shivamerla , adding those below. These logs are from failing toolkit: These from successful toolkit: Hope this helps to find the problem and resolve it. It seems that they are different after all. |
Thanks @denissabramovs will check these out and try to repro with 1.6.9 containerd version. |
If you won't be able to reproduce, please ping me and i'll try to reproduce it locally again. Then we could catch that issue and possibly make some patch together. |
Issue diagnosed and workaround MR can be found here: |
as kind has upgraded its containerd version to 1.9 which triggered issues to gpu-operator (see issue NVIDIA/gpu-operator#432) so we sticked kind version with containerd 1.8 also fix gpu installation
@klueska thanks! When will this be released? I assume it has been also tested with contained 1.6.10 which has been released recently? |
Thanks @cdesiniotis, I can confirm that it works with containerd 1.6.12 as well. |
it seems i have exactly the same issue with OS: CentOS 7.9.2009 my nvidia-driver-daemonset is looping it failed after
if i downgrade containerd to 1.6.8 everything is fixed |
There is another issue with containerd: if containerd is restared (version 1.6.9 and above), most pods are restarted, so together with nvidia container toolkit pod they end in endless restarting loop as toolkit tries to restart containerd which restarts the toolkit and driver and everything loops again. There is a fix for containerd, but it may not land yet everywhere. @tuxtof, I think you are hitting exactly this issue. |
thanks @xhejtman for linking the relevant issue. |
thanks @xhejtman so what is the situation ? , GPU operator is no more working with containerd version 1.6.9 and above |
I am no longer experiencing the issue once upgrading to containerd 1.6.15. Containerd 1.6.15 contains the fix to |
Ok i confirm the freshly released docker RPM containerd 1.6.15 fix the issue on my side too Nice |
I am currently having this issue with RHEL 8.8, rke2, containerd 1.6.24.
The following seems to function properly as long as runtime is default or set to runc, but if the runtime is set to nvidia, there is an error:
|
@msherm2 did you configure the container-toolkit correctly for RKE2 as documented here?
|
@shivamerla yes this is my helm chart configuration: Note: I have tested both files for CONTAINERD_CONFIG,
|
Update: I followed the instructions here to install containerd using this method, and I believe the critical part is enabling systemd cgroup. Since doing this, I am able to schedule the pods and workloads. |
Totally unrelated to the gpu operator, but this fixed my problem with getting the spin wasm shim working on a Rocky 8 cluster. Many thanks! |
Thanks! with sudo privileges, I generated the configuration via I'm not using the GPU-operator, since I already have the drivers and container toolkit installed on the host machine. Will keep monitoring for any intermittent pod sandbox crashes though. Versions: [/etc/containerd/config.toml]:
|
@msherm2 which containerd process did you end up using? The rke2 containerd or the node based containerd? Thanks. |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods --all-namespaces
kubernetes daemonset status:
kubectl get ds --all-namespaces
If a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAME
If a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAME
Output of running a container on the GPU machine:
docker run -it alpine echo foo
Docker configuration file:
cat /etc/docker/daemon.json
Docker runtime configuration:
docker info | grep runtime
NVIDIA shared directory:
ls -la /run/nvidia
NVIDIA packages directory:
ls -la /usr/local/nvidia/toolkit
NVIDIA driver directory:
ls -la /run/nvidia/driver
kubelet logs
journalctl -u kubelet > kubelet.logs
Driver folder is empty:
The text was updated successfully, but these errors were encountered: