Skip to content

Commit

Permalink
Merge branch 'rn-23.9.0' into 'master'
Browse files Browse the repository at this point in the history
RN for 23.9.0

See merge request nvidia/cloud-native/cnt-docs!337
  • Loading branch information
mikemckiernan committed Oct 20, 2023
2 parents d4f8897 + b829a06 commit 2082bb4
Show file tree
Hide file tree
Showing 6 changed files with 135 additions and 39 deletions.
40 changes: 19 additions & 21 deletions gpu-operator/life-cycle-policy.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,25 @@ Refer to :ref:`Upgrading the GPU Operator` for more information.
| Computing Manager
| for Kubernetes
* - v23.9.0
- | `535.104.12 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-104-12/index.html>`_ (default),
| `525.125.06 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-525-125-06/index.html>`_,
| `470.199.02 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-470-199-02/index.html>`_,
- `v0.6.4 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`_
- `1.14.3 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`_
- `0.14.2 <https://github.com/NVIDIA/k8s-device-plugin/releases>`_
- `3.2.6-3.1.9 <https://github.com/NVIDIA/gpu-monitoring-tools/releases>`_
- v0.14.2
- `0.8.2 <https://github.com/NVIDIA/gpu-feature-discovery/releases>`_
- `0.5.5 <https://github.com/NVIDIA/mig-parted/tree/main/deployments/gpu-operator>`_
- `3.2.6-1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`_,
- v23.9.0
- `v1.2.3 <https://github.com/NVIDIA/kubevirt-gpu-device-plugin>`_
- v0.2.4
- `2.16.1 <https://github.com/NVIDIA/gds-nvidia-fs/releases>`_
- v0.1.2
- v0.1.1

* - v23.6.1
- | `535.104.12 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-104-12/index.html>`_ (recommended),
| `535.104.05 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-535-104-05/index.html>`_ (default),
Expand Down Expand Up @@ -265,27 +284,6 @@ Refer to :ref:`Upgrading the GPU Operator` for more information.
- N/A
- N/A

* - v22.9.1
- | `525.60.13 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-525-60-13/index.html>`_ (default),
| `515.86.01 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-515-86-01/index.html>`_,
| `510.108.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-510-108-03/index.html>`_,
| `470.161.03 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-470-161-03/index.html>`_,
| `450.216.04 <https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-450-216-04/index.html>`_
- `v0.5.1 <https://ngc.nvidia.com/catalog/containers/nvidia:cloud-native:k8s-driver-manager>`_
- `1.11.0 <https://github.com/NVIDIA/nvidia-container-toolkit/releases>`_
- `0.13.0 <https://github.com/NVIDIA/k8s-device-plugin/releases>`_
- `3.1.3-3.1.2 <https://github.com/NVIDIA/gpu-monitoring-tools/releases>`_
- v0.10.1
- `0.7.0 <https://github.com/NVIDIA/gpu-feature-discovery/releases>`_
- `0.5.0 <https://github.com/NVIDIA/mig-parted/tree/master/deployments/gpu-operator>`_
- `3.1.3-1 <https://docs.nvidia.com/datacenter/dcgm/latest/release-notes/changelog.html>`_
- v22.9.1
- `v1.2.1 <https://github.com/NVIDIA/kubevirt-gpu-device-plugin>`_
- v0.2.0
- `2.14.13 <https://github.com/NVIDIA/gds-nvidia-fs/releases>`_
- N/A
- N/A

.. note::

- Driver version could be different with NVIDIA vGPU, as it depends on the driver
Expand Down
22 changes: 10 additions & 12 deletions gpu-operator/platform-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,23 +241,23 @@ The GPU Operator has been validated in the following scenarios:
| MicroK8s
* - Ubuntu 20.04 LTS
- 1.21---1.27
- 1.25---1.28
-
- 7.0 U3c, 8.0 U1
- 1.21---1.27
- 7.0 U3c, 8.0 U2
- 1.25---1.28
-
-

* - Ubuntu 22.04 LTS
- 1.21---1.27
- 1.25---1.28
-
-
-
-
- 1.26

* - CentOS 7
- 1.21---1.27
- 1.25---1.28
-
-
-
Expand All @@ -266,8 +266,7 @@ The GPU Operator has been validated in the following scenarios:

* - Red Hat Core OS
-
- | 4.9, 4.10, 4.11
| 4.12, 4.13
- | 4.9---4.14
-
-
-
Expand All @@ -277,10 +276,10 @@ The GPU Operator has been validated in the following scenarios:
| Enterprise
| Linux 8.4,
| 8.6, 8.7, 8.8
- 1.21---1.27
- 1.25---1.28
-
-
- 1.21---1.27
- 1.25---1.28
-
-

Expand Down Expand Up @@ -342,7 +341,7 @@ The GPU Operator has been validated in the following scenarios:
* - Ubuntu 20.04 LTS
- 1.21--1.27
-
- 7.0 U3c, 8.0 U1
- 7.0 U3c, 8.0 U2
- | 1.21, 1.22, 1.23,
| 1.24, 1.25
Expand All @@ -354,8 +353,7 @@ The GPU Operator has been validated in the following scenarios:

* - Red Hat Core OS
-
- | 4.9, 4.10, 4.11
| 4.12, 4.13
- 4.9---4.14
-
-

Expand Down
97 changes: 97 additions & 0 deletions gpu-operator/release-notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,103 @@ See the :ref:`GPU Operator Component Matrix` for a list of components included i

----

23.9.0
======

New Features
------------

* Added support for an NVIDIA driver custom resource definition that enables
running multiple GPU driver types and versions on the same cluster and adds
support for multiple operating system versions.
This feature is a technology preview.
Refer to :doc:`gpu-driver-configuration` for more information.

* Added support for additional Linux kernel variants for precompiled driver containers.

- driver:535-5.15.0-xxxx-nvidia-ubuntu22.04
- driver:535-5.15.0-xxxx-azure-ubuntu22.04
- driver:535-5.15.0-xxxx-aws-ubuntu22.04

Refer to the **Tags** tab of the `NVIDIA GPU Driver <https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver>`__
page in the NGC catalog to determine if a container for your kernel is built.
Refer to :doc:`precompiled-drivers` for information about using precompiled driver containers
and steps to build your own driver container.

* The API for the NVIDIA cluster policy custom resource definition is enhanced to include
the current state of the cluster policy.
When you view the cluster policy with a command like ``kubectl get cluster-policy``, the response
now includes a ``Status.Conditions`` field.

* Added support for the following software component versions:

- NVIDIA Data Center GPU Driver version 535.104.12.
- NVIDIA Driver Manager for Kubernetes v0.6.4
- NVIDIA Container Toolkit v1.14.3
- NVIDIA Kubernetes Device Plugin v1.14.2
- NVIDIA DCGM Exporter 3.2.6-3.1.9
- NVIDIA GPU Feature Discovery for Kubernetes v0.8.2
- NVIDIA MIG Manager for Kubernetes v0.5.5
- NVIDIA Data Center GPU Manager (DCGM) v3.2.6-1
- NVIDIA KubeVirt GPU Device Plugin v1.2.3
- NVIDIA vGPU Device Manager v0.2.4
- NVIDIA Kata Manager for Kubernetes v0.1.2
- NVIDIA Confidential Computing Manager for Kubernetes v0.1.1
- Node Feature Discovery v0.14.2

Refer to the :ref:`GPU Operator Component Matrix`
on the platform support page.

Fixed issues
------------

* Previously, if the ``RHEL_VERSION`` environment variable was set for the Operator, the variable was
propagated to the driver container and used in the ``--releasever`` argument to the ``dnf`` command.
With this release, you can specify the ``DNF_RELEASEVER`` environment variable for the driver container
to override the value of the ``--releasever`` argument.

* Previously, stale node feature and node feature topology objects could remain in the Kubernetes API
server after a node is deleted from the cluster.
The upgrade to Node Feature Discovery v0.14.2 includes an enhancement to garbage collection that
ensures the objects are removed after a node is deleted.

Known Limitations
------------------

* The GPU Driver container does not run on hosts that have a custom kernel with the SEV-SNP CPU feature
because of the missing ``kernel-headers`` package within the container.
With a custom kernel, NVIDIA recommends pre-installing the NVIDIA drivers on the host if you want to
run traditional container workloads with NVIDIA GPUs.
* If you cordon a node while the GPU driver upgrade process is already in progress,
the Operator uncordons the node and upgrades the driver on the node.
You can determine if an upgrade is in progress by checking the node label
``nvidia.com/gpu-driver-upgrade-state != upgrade-done``.
* NVIDIA vGPU is incompatible with KubeVirt v0.58.0, v0.58.1, and v0.59.0, as well
as OpenShift Virtualization 4.12.0---4.12.2.
* Using NVIDIA vGPU on bare metal nodes and NVSwitch is not supported.
* When installing the Operator on Amazon EKS and using Kubernetes versions lower than
``1.25``, specify the ``--set psp.enabled=true`` Helm argument because EKS enables
pod security policy (PSP).
If you use Kubernetes version ``1.25`` or higher, do not specify the ``psp.enabled``
argument so that the default value, ``false``, is used.
* All worker nodes in the Kubernetes cluster must run the same operating system version to use the NVIDIA GPU Driver container.
Alternatively, if you pre-install the NVIDIA GPU Driver on the nodes, then you can run different operating systems.
The technical preview feature that provides :doc:`gpu-driver-configuration` is also an alternative.
* NVIDIA GPUDirect Storage (GDS) is not supported with secure boot enabled systems.
* Driver Toolkit images are broken with Red Hat OpenShift version ``4.11.12`` and require cluster-level entitlements to be enabled
in this case for the driver installation to succeed.
* The NVIDIA GPU Operator can only be used to deploy a single NVIDIA GPU Driver type and version.
The NVIDIA vGPU and Data Center GPU Driver cannot be used within the same cluster.
The technical preview feature that provides :doc:`gpu-driver-configuration` is an alternative.
* The ``nouveau`` driver must be blacklisted when using NVIDIA vGPU.
Otherwise the driver fails to initialize the GPU with the error ``Failed to enable MSI-X`` in the system journal logs.
Additionally, all GPU operator pods become stuck in the ``Init`` state.
* When using RHEL 8 with containerd as the runtime and SELinux is enabled (either in permissive or enforcing mode)
at the host level, containerd must also be configured for SELinux, such as setting the ``enable_selinux=true``
configuration option.
Additionally, network-restricted environments are not supported.


23.6.1
======

Expand Down
6 changes: 3 additions & 3 deletions gpu-operator/versions.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@
"latest": "23.6.1",
"versions":
[
{
"version": "23.9.0"
},
{
"version": "23.6.1"
},
Expand All @@ -16,9 +19,6 @@
},
{
"version": "22.9.2"
},
{
"version": "22.9.1"
}
]
}
5 changes: 4 additions & 1 deletion openshift/versions.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
{
"latest": "23.6.1",
"latest": "23.9.0",
"versions":
[
{
"version": "23.9.0"
},
{
"version": "23.6.1"
}
Expand Down
4 changes: 2 additions & 2 deletions repo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ output_format = "linkcheck"
docs_root = "${root}/gpu-operator"
project = "gpu-operator"
name = "NVIDIA GPU Operator"
version = "23.6.1"
version = "23.9.0"
copyright_start = 2020
sphinx_exclude_patterns = [
"troubleshootings.rst",
Expand Down Expand Up @@ -199,7 +199,7 @@ output_format = "linkcheck"
docs_root = "${root}/openshift"
project = "gpu-operator-openshift"
name = "NVIDIA GPU Operator on Red Hat OpenShift Container Platform"
version = "23.6.1"
version = "23.9.0"
copyright_start = 2020
sphinx_exclude_patterns = [
"get-entitlement.rst",
Expand Down

0 comments on commit 2082bb4

Please sign in to comment.