Skip to content

Commit

Permalink
Merge pull request #76 from mikemckiernan/vgpu-gsp
Browse files Browse the repository at this point in the history
vGPU and GSP directories
  • Loading branch information
mikemckiernan authored Jul 26, 2024
2 parents ae92e63 + 9dddfd9 commit 6e8bea6
Showing 1 changed file with 32 additions and 0 deletions.
32 changes: 32 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,39 @@ Fixed Issues
Known Limitations
------------------

* NVIDIA vGPU Manager does not work correctly on nodes with GPUs that require Open Kernel module drivers and GPU System Processor (GSP) firmware.
The logs for vGPU Device Manager pods include lines like the following example:

.. code-block:: output
time="2024-07-23T08:50:11Z" level=fatal msg="error setting VGPU config: no parent devices found for GPU at index '1'"
time="2024-07-23T08:50:11Z" level=error msg="Failed to apply vGPU config: unable to apply config 'default': exit status 1"
time="2024-07-23T08:50:11Z" level=info msg="Setting node label: nvidia.com/vgpu.config.state=failed"
time="2024-07-23T08:50:11Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
The output of the ``kubectl exec -it nvidia-vgpu-manager-daemonset-xxxxx -n gpu-operator -- bash -c 'dmesg | grep -i nvrm'`` command
resembles the following example:

.. code-block:: output
kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 550.90.05 Release Build (dvs-builder@U16-I1-N08-05-1)
kernel: NVRM: RmFetchGspRmImages: No firmware image found
kernel: NVRM: GPU 0000:ae:00.0: RmInitAdapter failed! (0x61:0x56:1697)
kernel: NVRM: GPU 0000:ae:00.0: rm_init_adapter failed, device minor number 0
The vGPU Manager pods do not mount the ``/sys/module/firmware_class/parameters/path`` and ``/lib/firmware``
paths on the host and the pods fail to copy the GSP firmware files on the host.

As a workaround, you can add the following volume mounts to the vGPU Manager daemon set, for the ``nvidia-vgpu-manager-ctr`` container:

.. code-block:: yaml
- name: firmware-search-path
mountPath: /sys/module/firmware_class/parameters/path
- name: nv-firmware
mountPath: /lib/firmware
This issue is fixed in the next release of the GPU Operator.
* The ``1g.12gb`` MIG profile does not operate as expected on the NVIDIA GH200 GPU when the MIG configuration is set to ``all-balanced``.
* The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature
because of the missing ``kernel-headers`` package within the container.
Expand Down

0 comments on commit 6e8bea6

Please sign in to comment.