GPUs showing on some systems but not others #375

henrygd · 2025-01-06T23:32:33Z

henrygd · 2025-01-06T23:36:42Z

@SonGokussj4 Let's move discussion about the missing GPUs here please so everyone in the main GPU thread isn't notified.

SonGokussj4 · 2025-01-07T00:01:23Z

And then there was even weirder error :-) On a yet different machine.

user@frankenstein:~/beszel-agent$ nvidia-smi
Tue Jan  7 00:57:08 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070        Off | 00000000:01:00.0 Off |                  N/A |
| 57%   23C    P8               2W / 175W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
user@frankenstein:~/beszel-agent$ nvidia-smi -l 4 --query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,power.draw --format=csv,noheader,nounits
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.62
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.78

user@frankenstein:~/beszel-agent$ docker compose up -d
[+] Running 0/0
 ⠋ Container beszel-agent  Creating                                                                                0.0s
Error response from daemon: unknown or invalid runtime name: nvidia

$ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1

EDIT

/etc/docker/daemon.json was missing on the system.

Solved by creating /etc/docker/daemon.json with content:

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

And restarting docker

sudo systemctl restart docker

Now it shows GPU in beszel HUB just fine.

(This post was just for a newly found issue. The old in the previous post still not showing GPU)

Hilbsam · 2025-01-07T05:31:12Z

Okay, to be clear on your second (not working machine) in your docker-compose.yml you added GPU:"true", right? Because in your example it is missing.

It should look something like this:

services:
  beszel-agent:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: beszel-agent
    restart: unless-stopped
    network_mode: host
    runtime: nvidia
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      PORT: 45876
      KEY: "ssh-ed25519 AAAAC3...aJ34gQE+"
      GPU: "true"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # <---- We have 1 or 2 or 3 GPUs in our servers
              capabilities:
                - gpu

If that wasn't your problem:

What OS and version of the OS is the machine? I know that Windows with wsl2 can do some "nice" behaviors
What GPU driver version do you have installed?
Does the normal Docker example for the GPU work on your machine? Link: https://docs.docker.com/desktop/features/gpu/

SonGokussj4 · 2025-01-07T07:02:30Z

Oh, @zachatrocity didn't have that ENV in his docker-compose example. So I didn't add it. And it doesn't seems to be needed because in few of our servers, the GPU is showing as expected.

I've tried to add ghe GPU env but the problematic system still not showing the GPU.
I've assured myself with getting into the docker container.

user@ais60 ~/beszel-agent $ docker compose exec beszel-agent /bin/sh
# env
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-0
HOSTNAME=ais60
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PORT=45876
HOME=/root
GPU=true
CUDA_VERSION=12.0.0
NVIDIA_REQUIRE_CUDA=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TERM=xterm
NV_CUDA_CUDART_VERSION=12.0.107-1
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NVARCH=x86_64
KEY=ssh-ed25519 AAAAC...34gQE+
PWD=/
NVIDIA_VISIBLE_DEVICES=all

Machine OS

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Gpu drivers

$ nvidia-smi
Tue Jan  7 08:00:37 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 24%   35C    P8              12W / 180W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1070        Off | 00000000:03:00.0 Off |                  N/A |
|  0%   38C    P8              11W / 151W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

normal Docker example

Yes, when I installed agent with this command, it shows GPUs just fine!

curl -sL https://raw.githubusercontent.com/henrygd/beszel/main/supplemental/scripts/install-agent.sh -o install-agent.sh && chmod +x install-agent.sh && ./install-agent.sh -p 45876 -k "ssh-ed25519 AAA...34gQE+"

zachatrocity · 2025-01-07T16:10:42Z

Yeah @Hilbsam is that GPU: "true" env variable required? Because it's working without it on mine.

Hilbsam · 2025-01-07T19:57:39Z

@zachatrocity: Acording to #262 the env should be set. But yeah just tested it without the env and it works? @henrygd did I miss read there something?

@SonGokussj4 Can you please try following:

Try docker GPU "Hello world" runtime once on your not working machine, here is the link from the offical docker docs https://docs.docker.com/desktop/features/gpu/
The output should look something like that:

hilbsam@nexus:/mnt/hdd$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 7.947 ms
= 131.942 billion interactions per second
= 2638.835 single-precision GFLOP/s at 20 flops per interaction

Can you run nvidia-smi in the container agent? Also the output should look like the nvida-smi

root@nexus:/# nvidia-smi
Tue Jan  7 19:47:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8               8W / 200W |      0MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The docker gpu example should work or try the nvidia example sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi (LINK).
If that doesn't work check the NVIDIA Container Toolkit install and/or try to update your driver. It maybe that there was some change under the hood on the driver side of nvidia.

henrygd · 2025-01-07T20:46:43Z

The GPU env var is no longer necessary. It was only used to opt-in during testing.

If the binary version works then it must be something with the container environment. Really strange since it works on other systems. Definitely try testing nvidia-smi from within the container if you haven't already.

SonGokussj4 · 2025-01-07T21:02:40Z

Working system

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4090]
131072 bodies, total time for 10 iterations: 75.295 ms
= 2281.683 billion interactions per second
= 45633.660 single-precision GFLOP/s at 20 flops per interaction

Non-working system

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Error: only 0 Devices available, 1 requested.  Exiting.

Non-working system from inside

root@ais60:/# nvidia-smi
Failed to initialize NVML: Unknown Error

I'm starting to thing if this problem is related to some cgroups problems. I think we had that problem with some of the newly installed servers.

SonGokussj4 · 2025-01-07T21:08:11Z

Oh hell yea. Solved!

Comment out no-cgroups = true in /etc/nvidia-container-runtime/config.toml

$ sudo vim /etc/nvidia-container-runtime/config.toml

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = true  # <---- FIX: Comment out this line

Then restart docker services

$ sudo systemctl restart docker.service

Hello gpu world

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1070 Ti]
19456 bodies, total time for 10 iterations: 14.736 ms
= 256.872 billion interactions per second
= 5137.433 single-precision GFLOP/s at 20 flops per interaction

beszel-hub now correctly shows GPU ;-) 🎊

henrygd · 2025-01-07T21:35:20Z

Nice! Thanks for figuring that out!

I'll mention this in the docs when we add Docker info to the GPU page.

SonGokussj4 · 2025-01-07T22:02:38Z

Just a note, info about another problem I had on another server.

I could start the agent, i had nvidia-smi available on host system but not inside the container, the command nvidia-smi was not there)

And running the GPU "Hello world" gave me this

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Resolved by re/installing docker-ce

$ sudo apt-get install --reinstall docker-ce
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  containerd.io docker-ce-cli libltdl7
Suggested packages:
  aufs-tools cgroupfs-mount | cgroup-lite
The following packages will be REMOVED:
  containerd docker.io runc
The following NEW packages will be installed:
  containerd.io docker-ce docker-ce-cli libltdl7
0 upgraded, 4 newly installed, 3 to remove and 98 not upgraded.
Need to get 70.4 MB of archives.
After this operation, 11.6 MB disk space will be freed.
Do you want to continue? [Y/n] y
...

Now everything works.

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "Pascal
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 8.063 ms
= 130.048 billion interactions per second
= 2600.965 single-precision GFLOP/s at 20 flops per interaction

SonGokussj4 · 2025-01-07T22:18:31Z

Maybe that should be some kind of log/info/message that would be pushed to beszel-hub to inform user that for example the nvidia-smi is not available or different problems.

Anyway, really thanks for the debugging and introducing the gpu hello world, I didn't know it existed :-)
I love beszel, looking forward to the future! :-)

henrygd added the troubleshooting Maybe bug, maybe not label Jan 6, 2025

henrygd closed this as completed Jan 7, 2025

Hilbsam mentioned this issue Jan 16, 2025

GPU Stats #262

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUs showing on some systems but not others #375

GPUs showing on some systems but not others #375

henrygd commented Jan 6, 2025

henrygd commented Jan 6, 2025

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

Hilbsam commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

zachatrocity commented Jan 7, 2025

Hilbsam commented Jan 7, 2025 •

edited

Loading

henrygd commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

henrygd commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025

GPUs showing on some systems but not others #375

GPUs showing on some systems but not others #375

Comments

henrygd commented Jan 6, 2025

henrygd commented Jan 6, 2025

SonGokussj4 commented Jan 7, 2025 • edited Loading

EDIT

Hilbsam commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025 • edited Loading

Machine OS

Gpu drivers

normal Docker example

zachatrocity commented Jan 7, 2025

Hilbsam commented Jan 7, 2025 • edited Loading

henrygd commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025 • edited Loading

Working system

Non-working system

Non-working system from inside

SonGokussj4 commented Jan 7, 2025 • edited Loading

Hello gpu world

henrygd commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

Hilbsam commented Jan 7, 2025 •

edited

Loading

SonGokussj4 commented Jan 7, 2025 •

edited

Loading

SonGokussj4 commented Jan 7, 2025 •

edited

Loading