GPU operation is hard -- see Reliability and Operational Challenges by Meta Llama team (2024).
Complexity increases with the number of GPUs. Existing tools often fail to manage this at scale, thus a new approach for large-scale GPU operations. GPUd is designed to address several key challenges in GPU management:
-
Automated Error Detection: GPUd provides informational alerts that can surface bugs on a console before they become critical, reducing reliance on experienced technicians to identify errors.
-
Simplified Workflows: By reexamining and simplifying systems before automation, GPUd helps overcome the complexity of scenarios to be automated.
-
Modular Design: Each GPUd component handles a distinct and well-defined task. This approach allows for easy reuse and adaptation of key components across different GPU infrastructures.
-
Efficient Diagnostics: GPUd provides clear distinctions between software errors (fixable by reboots) and hardware errors (requiring component replacement), as well as identifying errors that impact performance versus those that do not.
-
Automated Verification: After hardware changes occur, GPUd runs automated verification processes to ensure system integrity.
-
Comprehensive Monitoring: GPUd actively monitors GPUs and effectively manages AI/ML workloads to ensure GPU efficiency and reliability.
-
Data Collection for Analysis: GPUd records raw metric data in a separate system for offline analysis, enabling weekly or monthly reports and more intricate calculations that are too complex to compute in real-time.
By addressing these challenges, GPUd simplifies GPU management, reduces human error, and improves overall system reliability and efficiency.
- Lepton AI: Collect GPU metrics and run automated verification and alerts.
- Metrics: supports time series metrics data in the custom format, in addition to the Prometheus format.
- NVIDIA GPU errors: scans dmesg, NVML, and nvidia-smi for identifying the real-time and historical GPU errors.
- NVIDIA GPU ECC errors: queries nvidia-smi and NVML APIs.
- NVIDIA GPU clock: scans nvidia-smi and NVML for hardware slowdown.
- NVIDIA GPU utilization: GPU memory, GPU utilization, GPU streaming multiprocessors (SM) occupancy, etc..
- NVIDIA GPU temperature: scans nvidia-smi and NVML for critical temperature thresholds and data.
- NVIDIA GPU power: scans nvidia-smi and NVML for current power draw and limits.
- NVIDIA GPU processes: uses NVML to list running processes.
- NVIDIA NVLink & NVSwitch: scans dmesg for any issues, NVML for status and errors.
- NVIDIA fabric manager: checks nvidia-fabricmanager unit status.
- NVIDIA InfiniBand: checks ibstat.
- NVIDIA direct RDMA (Remote Direct Memory Access): check lsmod, peermem.
- CPU, OS, memory, disk, file descriptor usage monitoring.
- Regex-based dmesg streaming and scanning.
- Workloads monitoring: supports containerd, docker, kubelet.
Many open source projects and studies informed and inspired this project:
- prometheus/node_exporter is a Prometheus metrics exporter for machine level metrics.
- NVIDIA/dcgm-exporter is a Prometheus metrics exporter for NVIDIA GPU machines, integrates with NVIDIA DCGM.
GPUd complements both node_exporter and dcgm-exporter focusing on the easy user experience and end-to-end solutions: GPUd is a single binary, whereas dcgm-exporter requires >500 MB of container images (as of August 2024). While GPUd provides all the critical metrics and health checks using NVML, DCGM supports much more comprehensive set of metrics.