Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUs showing on some systems but not others #375

Closed
henrygd opened this issue Jan 6, 2025 · 12 comments
Closed

GPUs showing on some systems but not others #375

henrygd opened this issue Jan 6, 2025 · 12 comments
Labels
troubleshooting Maybe bug, maybe not

Comments

@henrygd
Copy link
Owner

henrygd commented Jan 6, 2025

Context: #262 (comment)

@henrygd henrygd added the troubleshooting Maybe bug, maybe not label Jan 6, 2025
@henrygd
Copy link
Owner Author

henrygd commented Jan 6, 2025

@SonGokussj4 Let's move discussion about the missing GPUs here please so everyone in the main GPU thread isn't notified.

@SonGokussj4
Copy link

SonGokussj4 commented Jan 7, 2025

And then there was even weirder error :-) On a yet different machine.

user@frankenstein:~/beszel-agent$ nvidia-smi
Tue Jan  7 00:57:08 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2070        Off | 00000000:01:00.0 Off |                  N/A |
| 57%   23C    P8               2W / 175W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
user@frankenstein:~/beszel-agent$ nvidia-smi -l 4 --query-gpu=index,name,temperature.gpu,memory.used,memory.total,utilization.gpu,power.draw --format=csv,noheader,nounits
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.62
0, NVIDIA GeForce RTX 2070, 23, 1, 8192, 0, 2.78

user@frankenstein:~/beszel-agent$ docker compose up -d
[+] Running 0/0
 ⠋ Container beszel-agent  Creating                                                                                0.0s
Error response from daemon: unknown or invalid runtime name: nvidia

$ docker --version
Docker version 24.0.7, build 24.0.7-0ubuntu2~22.04.1

EDIT

/etc/docker/daemon.json was missing on the system.

Solved by creating /etc/docker/daemon.json with content:

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

And restarting docker

sudo systemctl restart docker

Now it shows GPU in beszel HUB just fine.

(This post was just for a newly found issue. The old in the previous post still not showing GPU)

@Hilbsam
Copy link

Hilbsam commented Jan 7, 2025

Okay, to be clear on your second (not working machine) in your docker-compose.yml you added GPU:"true", right? Because in your example it is missing.

It should look something like this:

services:
  beszel-agent:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: beszel-agent
    restart: unless-stopped
    network_mode: host
    runtime: nvidia
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    environment:
      PORT: 45876
      KEY: "ssh-ed25519 AAAAC3...aJ34gQE+"
      GPU: "true"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all  # <---- We have 1 or 2 or 3 GPUs in our servers
              capabilities:
                - gpu

If that wasn't your problem:

  • What OS and version of the OS is the machine? I know that Windows with wsl2 can do some "nice" behaviors
  • What GPU driver version do you have installed?
  • Does the normal Docker example for the GPU work on your machine? Link: https://docs.docker.com/desktop/features/gpu/

@SonGokussj4
Copy link

SonGokussj4 commented Jan 7, 2025

Oh, @zachatrocity didn't have that ENV in his docker-compose example. So I didn't add it. And it doesn't seems to be needed because in few of our servers, the GPU is showing as expected.

I've tried to add ghe GPU env but the problematic system still not showing the GPU.
I've assured myself with getting into the docker container.

user@ais60 ~/beszel-agent $ docker compose exec beszel-agent /bin/sh
# env
NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-0
HOSTNAME=ais60
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PORT=45876
HOME=/root
GPU=true
CUDA_VERSION=12.0.0
NVIDIA_REQUIRE_CUDA=cuda>=12.0 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471
NVIDIA_DRIVER_CAPABILITIES=compute,utility
TERM=xterm
NV_CUDA_CUDART_VERSION=12.0.107-1
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NVARCH=x86_64
KEY=ssh-ed25519 AAAAC...34gQE+
PWD=/
NVIDIA_VISIBLE_DEVICES=all

Machine OS

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

Gpu drivers

$ nvidia-smi
Tue Jan  7 08:00:37 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070 Ti     Off | 00000000:01:00.0 Off |                  N/A |
| 24%   35C    P8              12W / 180W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce GTX 1070        Off | 00000000:03:00.0 Off |                  N/A |
|  0%   38C    P8              11W / 151W |      2MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

normal Docker example

Yes, when I installed agent with this command, it shows GPUs just fine!

curl -sL https://raw.githubusercontent.com/henrygd/beszel/main/supplemental/scripts/install-agent.sh -o install-agent.sh && chmod +x install-agent.sh && ./install-agent.sh -p 45876 -k "ssh-ed25519 AAA...34gQE+"

@zachatrocity
Copy link

Yeah @Hilbsam is that GPU: "true" env variable required? Because it's working without it on mine.

@Hilbsam
Copy link

Hilbsam commented Jan 7, 2025

@zachatrocity: Acording to #262 the env should be set. But yeah just tested it without the env and it works? @henrygd did I miss read there something?

@SonGokussj4 Can you please try following:

hilbsam@nexus:/mnt/hdd$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 7.947 ms
= 131.942 billion interactions per second
= 2638.835 single-precision GFLOP/s at 20 flops per interaction
  • Can you run nvidia-smi in the container agent? Also the output should look like the nvida-smi
root@nexus:/# nvidia-smi
Tue Jan  7 19:47:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off | 00000000:01:00.0 Off |                  N/A |
|  0%   43C    P8               8W / 200W |      0MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The docker gpu example should work or try the nvidia example sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi (LINK).
If that doesn't work check the NVIDIA Container Toolkit install and/or try to update your driver. It maybe that there was some change under the hood on the driver side of nvidia.

@henrygd
Copy link
Owner Author

henrygd commented Jan 7, 2025

The GPU env var is no longer necessary. It was only used to opt-in during testing.

If the binary version works then it must be something with the container environment. Really strange since it works on other systems. Definitely try testing nvidia-smi from within the container if you haven't already.

@SonGokussj4
Copy link

SonGokussj4 commented Jan 7, 2025

Working system

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4090]
131072 bodies, total time for 10 iterations: 75.295 ms
= 2281.683 billion interactions per second
= 45633.660 single-precision GFLOP/s at 20 flops per interaction

Non-working system

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation)
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Error: only 0 Devices available, 1 requested.  Exiting.

Non-working system from inside

root@ais60:/# nvidia-smi
Failed to initialize NVML: Unknown Error

I'm starting to thing if this problem is related to some cgroups problems. I think we had that problem with some of the newly installed servers.

@SonGokussj4
Copy link

SonGokussj4 commented Jan 7, 2025

Oh hell yea. Solved!

Comment out no-cgroups = true in /etc/nvidia-container-runtime/config.toml

$ sudo vim /etc/nvidia-container-runtime/config.toml

[nvidia-container-cli]
#debug = "/var/log/nvidia-container-toolkit.log"
environment = []
#ldcache = "/etc/ld.so.cache"
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
#no-cgroups = true  # <---- FIX: Comment out this line

Then restart docker services

$ sudo systemctl restart docker.service

Hello gpu world

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1070 Ti]
19456 bodies, total time for 10 iterations: 14.736 ms
= 256.872 billion interactions per second
= 5137.433 single-precision GFLOP/s at 20 flops per interaction

beszel-hub now correctly shows GPU ;-) 🎊

@henrygd
Copy link
Owner Author

henrygd commented Jan 7, 2025

Nice! Thanks for figuring that out!

I'll mention this in the docs when we add Docker info to the GPU page.

@henrygd henrygd closed this as completed Jan 7, 2025
@SonGokussj4
Copy link

Just a note, info about another problem I had on another server.

I could start the agent, i had nvidia-smi available on host system but not inside the container, the command nvidia-smi was not there)

And running the GPU "Hello world" gave me this

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Resolved by re/installing docker-ce

$ sudo apt-get install --reinstall docker-ce
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  containerd.io docker-ce-cli libltdl7
Suggested packages:
  aufs-tools cgroupfs-mount | cgroup-lite
The following packages will be REMOVED:
  containerd docker.io runc
The following NEW packages will be installed:
  containerd.io docker-ce docker-ce-cli libltdl7
0 upgraded, 4 newly installed, 3 to remove and 98 not upgraded.
Need to get 70.4 MB of archives.
After this operation, 11.6 MB disk space will be freed.
Do you want to continue? [Y/n] y
...

Now everything works.

$ docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark -device=0
...
> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "Pascal
> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 6GB]
10240 bodies, total time for 10 iterations: 8.063 ms
= 130.048 billion interactions per second
= 2600.965 single-precision GFLOP/s at 20 flops per interaction

@SonGokussj4
Copy link

Maybe that should be some kind of log/info/message that would be pushed to beszel-hub to inform user that for example the nvidia-smi is not available or different problems.

Anyway, really thanks for the debugging and introducing the gpu hello world, I didn't know it existed :-)
I love beszel, looking forward to the future! :-)

@Hilbsam Hilbsam mentioned this issue Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
troubleshooting Maybe bug, maybe not
Projects
None yet
Development

No branches or pull requests

4 participants