-
-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUs showing on some systems but not others #375
Comments
@SonGokussj4 Let's move discussion about the missing GPUs here please so everyone in the main GPU thread isn't notified. |
And then there was even weirder error :-) On a yet different machine. user@frankenstein:~/beszel-agent$ nvidia-smi
Tue Jan 7 00:57:08 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2070 Off | 00000000:01:00.0 Off | N/A |
| 57% 23C P8 2W / 175W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
user@frankenstein:~/beszel-agent$ nvidia-smi -l 4 --query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,power.draw --format=csv,noheader,nounits
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.62
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.78
user@frankenstein:~/beszel-agent$ docker compose up -d
[+] Running 0/0
⠋ Container beszel-agent Creating 0.0s
Error response from daemon: unknown or invalid runtime name: nvidia
$ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1 EDIT
Solved by creating {
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
} And restarting docker sudo systemctl restart docker Now it shows GPU in beszel HUB just fine. (This post was just for a newly found issue. The old in the previous post still not showing GPU) |
Okay, to be clear on your second (not working machine) in your docker-compose.yml you added GPU:"true", right? Because in your example it is missing. It should look something like this:
If that wasn't your problem:
|
Oh, @zachatrocity didn't have that ENV in his docker-compose example. So I didn't add it. And it doesn't seems to be needed because in few of our servers, the GPU is showing as expected. I've tried to add ghe GPU env but the problematic system still not showing the GPU. user@ais60 ~/beszel-agent $ docker compose exec beszel-agent /bin/sh
# env
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-0
HOSTNAME=ais60
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PORT=45876
HOME=/root
GPU=true
CUDA_VERSION=12.0.0
NVIDIA_REQUIRE_CUDA=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TERM=xterm
NV_CUDA_CUDART_VERSION=12.0.107-1
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NVARCH=x86_64
KEY=ssh-ed25519 AAAAC...34gQE+
PWD=/
NVIDIA_VISIBLE_DEVICES=all Machine OS$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy Gpu drivers$ nvidia-smi
Tue Jan 7 08:00:37 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce GTX 1070 Ti Off | 00000000:01:00.0 Off | N/A |
| 24% 35C P8 12W / 180W | 2MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce GTX 1070 Off | 00000000:03:00.0 Off | N/A |
| 0% 38C P8 11W / 151W | 2MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+ normal Docker exampleYes, when I installed agent with this command, it shows GPUs just fine! curl -sL https://raw.githubusercontent.com/henrygd/beszel/main/supplemental/scripts/install-agent.sh -o install-agent.sh && chmod +x install-agent.sh && ./install-agent.sh -p 45876 -k "ssh-ed25519 AAA...34gQE+" |
Yeah @Hilbsam is that |
@zachatrocity: Acording to #262 the env should be set. But yeah just tested it without the env and it works? @henrygd did I miss read there something? @SonGokussj4 Can you please try following:
The docker gpu example should work or try the nvidia example |
The If the binary version works then it must be something with the container environment. Really strange since it works on other systems. Definitely try testing |
Working system$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9
> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4090]
131072 bodies, total time for 10 iterations: 75.295 ms
= 2281.683 billion interactions per second
= 45633.660 single-precision GFLOP/s at 20 flops per interaction Non-working system$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Error: only 0 Devices available, 1 requested. Exiting. Non-working system from insideroot@ais60:/# nvidia-smi
Failed to initialize NVML: Unknown Error I'm starting to thing if this problem is related to some |
Oh hell yea. Solved! Comment out $ sudo vim /etc/nvidia-container-runtime/config.toml
[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = true # <---- FIX: Comment out this line Then restart docker services $ sudo systemctl restart docker.service Hello gpu world$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1070 Ti]
19456 bodies, total time for 10 iterations: 14.736 ms
= 256.872 billion interactions per second
= 5137.433 single-precision GFLOP/s at 20 flops per interaction beszel-hub now correctly shows GPU ;-) 🎊 |
Nice! Thanks for figuring that out! I'll mention this in the docs when we add Docker info to the GPU page. |
Just a note, info about another problem I had on another server. I could start the agent, i had And running the GPU "Hello world" gave me this $ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. Resolved by re/installing $ sudo apt-get install --reinstall docker-ce
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
containerd.io docker-ce-cli libltdl7
Suggested packages:
aufs-tools cgroupfs-mount | cgroup-lite
The following packages will be REMOVED:
containerd docker.io runc
The following NEW packages will be installed:
containerd.io docker-ce docker-ce-cli libltdl7
0 upgraded, 4 newly installed, 3 to remove and 98 not upgraded.
Need to get 70.4 MB of archives.
After this operation, 11.6 MB disk space will be freed.
Do you want to continue? [Y/n] y
... Now everything works. $ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "Pascal
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 8.063 ms
= 130.048 billion interactions per second
= 2600.965 single-precision GFLOP/s at 20 flops per interaction |
Maybe that should be some kind of log/info/message that would be pushed to beszel-hub to inform user that for example the nvidia-smi is not available or different problems. Anyway, really thanks for the debugging and introducing the gpu hello world, I didn't know it existed :-) |
Context: #262 (comment)
The text was updated successfully, but these errors were encountered: