Skip to content

Latest commit

 

History

History
1727 lines (1530 loc) · 55.5 KB

File metadata and controls

1727 lines (1530 loc) · 55.5 KB

OVN Kubernetes with Host Based Networking

In this configuration OVN Kubernetes is offloaded to the DPU and combined with NVIDIA Host Based Networking (HBN).

Prerequisites

The system is set up as described in the system prerequisites. The OVN Kubernetes with HBN case has the additional requirements:

DPU prerequisites

  • Bluefield 3 with 32GB of RAM

Software prerequisites

This guide uses the following tools which must be installed on the machine where the commands contained in this guide run..

  • kubectl
  • helm
  • envsubst

Network prerequisites

Control plane Nodes

  • Open vSwitch (OVS) packages installed - i.e. openvswitch-switch for Ubuntu 24.04
  • out-of-band management port should be configured as OVS bridge port with "bridge-uplink" OVS metadata This addresses a known issue.
  • DNS stub resolver should be disabled if using systemd resolvd

Worker Nodes

Kubernetes prerequisites

  • CNI not installed
  • kube-proxy not installed
  • coreDNS should be configured to run only on control plane nodes - e.g. using NodeAffinity. This addresses a known issue.
  • control plane setup is complete before starting this guide
  • worker nodes are not added until indicated by this guide

Control plane Nodes

  • Have the labels:
    • "k8s.ovn.org/zone-name": $KUBERNETES_NODE_NAME

Worker Nodes

  • Have the labels:
    • "k8s.ovn.org/dpu-host": ""
    • "k8s.ovn.org/zone-name": $KUBERNETES_NODE_NAME
  • Have the annotations:
    • "k8s.ovn.org/remote-zone-migrated": $KUBERNETES_NODE_NAME

Virtual functions

A number of virtual functions (VFs) will be created on hosts when provisioning DPUs. Certain of these VFs are marked for specific usage:

  • The first VF (vf0) is used by provisioning components.
  • The second VF (vf1) is used by ovn-kubernetes.
  • The remaining VFs are allocated by SR-IOV Device Plugin. Each pod using OVN Kubernetes in DPU mode as its primary CNI will have one of these VFs injected at Pod creation time.

Installation guide

0. Required variables

The following variables are required by this guide. A sensible default is provided where it makes sense, but many will be specific to the target infrastructure.

Commands in this guide are run in the same directory that contains this readme.

## IP Address for the Kubernetes API server of the target cluster on which DPF is installed.
## This should never include a scheme or a port.
## e.g. 10.10.10.10
export TARGETCLUSTER_API_SERVER_HOST=

## Port for the Kubernetes API server of the target cluster on which DPF is installed.
export TARGETCLUSTER_API_SERVER_PORT=6443

## IP address range for hosts in the target cluster on which DPF is installed.
## This is a CIDR in the form e.g. 10.10.10.0/24
export TARGETCLUSTER_NODE_CIDR=

## Virtual IP used by the load balancer for the DPU Cluster. Must be a reserved IP from the management subnet and not allocated by DHCP.
export DPUCLUSTER_VIP=

## DPU_P0 is the name of the first port of the DPU. This name must be the same on all worker nodes.
export DPU_P0=

## DPU_P0_VF1 is the name of the second Virtual Function (VF) of the first port of the DPU. This name must be the same on all worker nodes.
export DPU_P0_VF1=

## Interface on which the DPUCluster load balancer will listen. Should be the management interface of the control plane node.
export DPUCLUSTER_INTERFACE=

# IP address to the NFS server used as storage for the BFB.
export NFS_SERVER_IP=

# API key for accessing containers and helm charts from the NGC private repository.
# Note: This isn't technically required when using public images but is included here to demonstrate the secret flow in DPF when using images from a private registry.
export NGC_API_KEY=

## POD_CIDR is the CIDR used for pods in the target Kubernetes cluster.
export POD_CIDR=10.233.64.0/18

## SERVICE_CIDR is the CIDR used for services in the target Kubernetes cluster.
## This is a CIDR in the form e.g. 10.10.10.0/24
export SERVICE_CIDR=10.233.0.0/18 

## DPF_VERSION is the version of the DPF components which will be deployed in this guide.
export DPF_VERSION=v24.10.0

## URL to the BFB used in the `bfb.yaml` and linked by the DPUSet.
export BLUEFIELD_BITSTREAM="https://content.mellanox.com/BlueField/BFBs/Ubuntu22.04/bf-bundle-2.9.1-30_24.11_ubuntu-22.04_prod.bfb"

1. CNI installation

OVN Kubernetes is used as the primary CNI for the cluster. On worker nodes the primary CNI will be accelerated by offloading work to the DPU. On control plane nodes OVN Kubernetes will run without offloading.

Create the Namespace

kubectl create ns ovn-kubernetes

Docker login and create image pull secret

The image pull secret is required when using a private registry to host images and helm charts. If using a public registry this section can be ignored.

helm registry login nvcr.io --username \$oauthtoken --password $NGC_API_KEY

kubectl -n ovn-kubernetes create secret docker-registry dpf-pull-secret --docker-server=nvcr.io --docker-username="\$oauthtoken" --docker-password=$NGC_API_KEY

Install OVN Kubernetes from the helm chart

Install the OVN Kubernetes CNI components from the helm chart. A number of environment variables must be set before running this command.

envsubst < manifests/01-cni-installation/helm-values/ovn-kubernetes.yml | helm upgrade --install -n ovn-kubernetes ovn-kubernetes oci://ghcr.io/nvidia/ovn-kubernetes-chart --version $DPF_VERSION --values -
Expand for detailed helm values
tags:
  ovn-kubernetes-resource-injector: false
global:
  imagePullSecretName: "dpf-pull-secret"
k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
ovnkube-node-dpu-host:
  nodeMgmtPortNetdev: $DPU_P0_VF1
  gatewayOpts: --gateway-interface=$DPU_P0
## Note this CIDR is followed by a trailing /24 which informs OVN Kubernetes on how to split the CIDR per node.
podNetwork: $POD_CIDR/24
serviceNetwork: $SERVICE_CIDR
ovn-kubernetes-resource-injector:
  resourceName: nvidia.com/bf3-p0-vfs
dpuServiceAccountNamespace: dpf-operator-system

Verification

These verification commands may need to be run multiple times to ensure the condition is met.

Verify the CNI installation with:

## Ensure all nodes in the cluster are ready.
kubectl wait --for=condition=ready nodes --all
## Ensure all pods in the ovn-kubernetes namespace are ready.
kubectl wait --for=condition=ready --namespace ovn-kubernetes pods --all --timeout=300s

2. DPF Operator installation

Log in to private registries

The login and secret is required when using a private registry to host images and helm charts. If using a public registry this section can be ignored.

kubectl create namespace dpf-operator-system
kubectl -n dpf-operator-system create secret docker-registry dpf-pull-secret --docker-server=nvcr.io --docker-username="\$oauthtoken" --docker-password=$NGC_API_KEY
helm registry login nvcr.io --username \$oauthtoken --password $NGC_API_KEY

Install cert-manager

Cert manager is a prerequisite which is used to provide certificates for webhooks used by DPF and its dependencies.

helm repo add jetstack https://charts.jetstack.io --force-update
helm upgrade --install --create-namespace --namespace cert-manager cert-manager jetstack/cert-manager --version v1.16.1 -f ./manifests/02-dpf-operator-installation/helm-values/cert-manager.yml
Expand for detailed helm values
startupapicheck:
  enabled: false
crds:
  enabled: true
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node-role.kubernetes.io/master
              operator: Exists
        - matchExpressions:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists
tolerations:
  - operator: Exists
    effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
  - operator: Exists
    effect: NoSchedule
    key: node-role.kubernetes.io/master
cainjector:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists
  tolerations:
    - operator: Exists
      effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
    - operator: Exists
      effect: NoSchedule
      key: node-role.kubernetes.io/master
webhook:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists
  tolerations:
    - operator: Exists
      effect: NoSchedule
      key: node-role.kubernetes.io/control-plane
    - operator: Exists
      effect: NoSchedule
      key: node-role.kubernetes.io/master

Install a CSI to back the DPUCluster etcd

In this guide the local-path-provisioner CSI from Rancher is used to back the etcd of the Kamaji based DPUCluster. This should be substituted for a reliable performant CNI to back etcd.

curl https://codeload.github.com/rancher/local-path-provisioner/tar.gz/v0.0.30 | tar -xz --strip=3 local-path-provisioner-0.0.30/deploy/chart/local-path-provisioner/
kubectl create ns local-path-provisioner
helm install -n local-path-provisioner local-path-provisioner ./local-path-provisioner --version 0.0.30 -f ./manifests/02-dpf-operator-installation/helm-values/local-path-provisioner.yml
Expand for detailed helm values
tolerations:
  - operator: Exists
    effect: NoSchedule
    key: node-role.kubernetes.io/control-plane
  - operator: Exists
    effect: NoSchedule
    key: node-role.kubernetes.io/master

Create secrets and storage required by the DPF Operator

A number of environment variables must be set before running this command.

cat manifests/02-dpf-operator-installation/*.yaml | envsubst | kubectl apply -f - 

This deploys the following objects:

Secret for pulling images and helm charts
---
apiVersion: v1
kind: Secret
metadata:
  name: ngc-doca-oci-helm
  namespace: dpf-operator-system
  labels:
    argocd.argoproj.io/secret-type: repository
stringData:
  name: nvstaging-doca-oci
  url: nvcr.io/nvstaging/doca
  type: helm
  ## Note `no_variable` here is used to ensure envsubst renders the correct username which is `$oauthtoken`
  username: $${no_variable}oauthtoken
  password: $NGC_API_KEY
---
apiVersion: v1
kind: Secret
metadata:
  name: ngc-doca-https-helm
  namespace: dpf-operator-system
  labels:
    argocd.argoproj.io/secret-type: repository
stringData:
  name: nvstaging-doca-https
  url: https://helm.ngc.nvidia.com/nvstaging/doca
  type: helm
  username: $${no_variable}oauthtoken
  password: $NGC_API_KEY
PersistentVolume and PersistentVolumeClaim for the provisioning controller
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: bfb-pv
spec:
  capacity:
    storage: 10Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  nfs: 
    path: /mnt/dpf_share/bfb
    server: $NFS_SERVER_IP
  persistentVolumeReclaimPolicy: Delete
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: bfb-pvc
  namespace: dpf-operator-system
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  volumeMode: Filesystem

Deploy the DPF Operator

A number of environment variables must be set before running this command.

envsubst < ./manifests/02-dpf-operator-installation/helm-values/dpf-operator.yml | helm upgrade --install -n dpf-operator-system dpf-operator oci://ghcr.io/nvidia/dpf-operator --version=$DPF_VERSION --values -
Expand for detailed helm values
imagePullSecrets:
  - name: dpf-pull-secret
kamaji-etcd:
  persistentVolumeClaim:
    storageClassName: local-path
node-feature-discovery:
  worker:
    extraEnvs:
      - name: "KUBERNETES_SERVICE_HOST"
        value: "$TARGETCLUSTER_API_SERVER_HOST"
      - name: "KUBERNETES_SERVICE_PORT"
        value: "$TARGETCLUSTER_API_SERVER_PORT"

Verification

These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF Operator installation with:

## Ensure the DPF Operator deployment is available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-operator-controller-manager
## Ensure all pods in the DPF Operator system are ready.
kubectl wait --for=condition=ready --namespace dpf-operator-system pods --all

3. DPF System installation

This section involves creating the DPF system components and some basic infrastructure required for a functioning DPF-enabled cluster.

Deploy the DPF System components

A number of environment variables must be set before running this command.

kubectl create ns dpu-cplane-tenant1
cat manifests/03-dpf-system-installation/*.yaml | envsubst | kubectl apply -f - 

This will create the following objects:

DPF Operator to install the DPF System components
---
apiVersion: operator.dpu.nvidia.com/v1alpha1
kind: DPFOperatorConfig
metadata:
  name: dpfoperatorconfig
  namespace: dpf-operator-system
spec:
  imagePullSecrets:
  - dpf-pull-secret
  provisioningController:
    bfbPVCName: "bfb-pvc"
    dmsTimeout: 900
  kamajiClusterManager:
    disable: false
DPUCluster to serve as Kubernetes control plane for DPU nodes
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUCluster
metadata:
  name: dpu-cplane-tenant1
  namespace: dpu-cplane-tenant1
spec:
  type: kamaji
  maxNodes: 10
  version: v1.30.2
  clusterEndpoint:
    # deploy keepalived instances on the nodes that match the given nodeSelector.
    keepalived:
      # interface on which keepalived will listen. Should be the oob interface of the control plane node.
      interface: $DPUCLUSTER_INTERFACE
      # Virtual IP reserved for the DPU Cluster load balancer. Must not be allocatable by DHCP.
      vip: $DPUCLUSTER_VIP
      # virtualRouterID must be in range [1,255], make sure the given virtualRouterID does not duplicate with any existing keepalived process running on the host
      virtualRouterID: 126
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""

Verification

These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF System with:

## Ensure the provisioning and DPUService controller manager deployments are available.
kubectl rollout status deployment --namespace dpf-operator-system dpf-provisioning-controller-manager dpuservice-controller-manager
## Ensure all other deployments in the DPF Operator system are Available.
kubectl rollout status deployment --namespace dpf-operator-system
## Ensure the DPUCluster is ready for nodes to join.
kubectl wait --for=condition=ready --namespace dpu-cplane-tenant1 dpucluster --all

4. Install components to enable accelerated CNI nodes

OVN Kubernetes will accelerate traffic by attaching a VF to each pod using the primary CNI. This VF is used to offload flows to the DPU. This section details the components needed to connect pods to the offloaded OVN Kubernetes CNI.

Install Multus and SRIOV Network Operator using NVIDIA Network Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --force-update
helm upgrade --no-hooks --install --create-namespace --namespace nvidia-network-operator network-operator nvidia/network-operator --version 24.7.0 -f ./manifests/04-enable-accelerated-cni/helm-values/network-operator.yml
Expand for detailed helm values
nfd:
  enabled: false
  deployNodeFeatureRules: false
sriovNetworkOperator:
  enabled: true
sriov-network-operator:
  operator:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: node-role.kubernetes.io/master
                  operator: Exists
            - matchExpressions:
                - key: node-role.kubernetes.io/control-plane
                  operator: Exists
  crds:
    enabled: true
  sriovOperatorConfig:
    deploy: true
    configDaemonNodeSelector: null
operator:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: Exists
          - matchExpressions:
              - key: node-role.kubernetes.io/control-plane
                operator: Exists

Install the OVN Kubernetes resource injection webhook

The OVN Kubernetes resource injection webhook injected each pod scheduled to a worker node with a request for a VF and a Network Attachment Definition. This webhook is part of the same helm chart as the other components of the OVN Kubernetes CNI. Here it is installed by adjusting the existing helm installation to add the webhook component to the installation.

envsubst < manifests/04-enable-accelerated-cni/helm-values/ovn-kubernetes.yml | helm upgrade --install -n ovn-kubernetes ovn-kubernetes oci://ghcr.io/nvidia/ovn-kubernetes-chart --version $DPF_VERSION --values -
Expand for detailed helm values
tags:
  ## Enable the ovn-kubernetes-resource-injector
  ovn-kubernetes-resource-injector: true
global:
  imagePullSecretName: "dpf-pull-secret"
k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
ovnkube-node-dpu-host:
  nodeMgmtPortNetdev: $DPU_P0_VF1
  gatewayOpts: --gateway-interface=$DPU_P0
## Note this CIDR is followed by a trailing /24 which informs OVN Kubernetes on how to split the CIDR per node.
podNetwork: $POD_CIDR/24
serviceNetwork: $SERVICE_CIDR
ovn-kubernetes-resource-injector:
  resourceName: nvidia.com/bf3-p0-vfs
dpuServiceAccountNamespace: dpf-operator-system

Apply the NICClusterConfiguration and SriovNetworkNodePolicy

cat manifests/04-enable-accelerated-cni/*.yaml | envsubst | kubectl apply -f -

This will deploy the following objects:

NICClusterPolicy for the NVIDIA Network Operator
---
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  secondaryNetwork:
    multus:
      image: multus-cni
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v3.9.3
SriovNetworkNodePolicy for the SR-IOV Network Operator
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: bf3-p0-vfs
  namespace: nvidia-network-operator
spec:
  mtu: 1500
  nicSelector:
    deviceID: "a2dc"
    vendor: "15b3"
    pfNames:
    - $DPU_P0#2-45
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  numVfs: 46
  resourceName: bf3-p0-vfs
  isRdma: true
  externallyManaged: true
  deviceType: netdevice
  linkType: eth

Verification

These verification commands may need to be run multiple times to ensure the condition is met.

Verify the DPF System with:

## Ensure the provisioning and DPUService controller manager deployments are available.
kubectl wait --for=condition=Ready --namespace nvidia-network-operator pods --all
## Expect the following Daemonsets to be successfully rolled out.
kubectl rollout status daemonset --namespace nvidia-network-operator kube-multus-ds sriov-network-config-daemon sriov-device-plugin 
## Expect the network injector to be successfully rolled out.
kubectl rollout status deployment --namespace ovn-kubernetes ovn-kubernetes-ovn-kubernetes-resource-injector  

5. DPU Provisioning and Service Installation

In this step we deploy our DPUs and the services that will run on them. There are two ways to do this and that will be explained in the following sections 5.1 and 5.2.

Label the image pull secret

The image pull secret is required when using a private registry to host images and helm charts. If using a public registry this section can be ignored.

kubectl -n ovn-kubernetes label secret dpf-pull-secret dpu.nvidia.com/image-pull-secret=""

5.1. With User-Defined DPUSet and DPUService

In this mode the user is expected to create their own DPUSet and DPUService objects.

Create the DPF provisioning and DPUService objects

A number of environment variables must be set before running this command.

cat manifests/05.1-dpuservice-installation/*.yaml | envsubst | kubectl apply -f - 

This will deploy the following objects:

BFB to download Bluefield Bitstream to a shared volume
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: BFB
metadata:
  name: bf-bundle
  namespace: dpf-operator-system
spec:
  url: $BLUEFIELD_BITSTREAM
DPUSet to provision DPUs on worker nodes
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: DPUSet
metadata:
  name: dpuset
  namespace: dpf-operator-system
spec:
  nodeSelector:
    matchLabels:
      feature.node.kubernetes.io/dpu-enabled: "true"
  strategy:
    rollingUpdate:
      maxUnavailable: "10%"
    type: RollingUpdate
  dpuTemplate:
    spec:
      dpuFlavor: dpf-provisioning-hbn-ovn
      bfb:
        name: bf-bundle
      nodeEffect:
        taint:
          key: "dpu"
          value: "provisioning"
          effect: NoSchedule
      automaticNodeReboot: true
OVN DPUService to deploy OVN workloads to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
  name: ovn-dpu
  namespace: dpf-operator-system 
spec:
  helmChart:
    source:
      repoURL: oci://ghcr.io/nvidia
      chart: ovn-kubernetes-chart
      version: $DPF_VERSION
    values:
      tags:
        ovn-kubernetes-resource-injector: false
        ovnkube-node-dpu: true
        ovnkube-node-dpu-host: false
        ovnkube-single-node-zone: false
        ovnkube-control-plane: false
      k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
      podNetwork: $POD_CIDR/24
      serviceNetwork: $SERVICE_CIDR
      global:
        gatewayOpts: "--gateway-interface=br-ovn --gateway-uplink-port=puplinkbrovn"
        imagePullSecretName: dpf-pull-secret
      ovnkube-node-dpu:
        kubernetesSecretName: "ovn-dpu" # user needs to populate based on DPUServiceCredentialRequest
        vtepCIDR: "10.0.120.0/22" # user needs to populate based on DPUServiceIPAM
        hostCIDR: $TARGETCLUSTER_NODE_CIDR
        ipamPool: "pool1" # user needs to populate based on DPUServiceIPAM
        ipamPoolType: "cidrpool" # user needs to populate based on DPUServiceIPAM
        ipamVTEPIPIndex: 0
        ipamPFIPIndex: 1
HBN DPUService to deploy HBN workloads to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
  name: doca-hbn
  namespace: dpf-operator-system
spec:
  serviceID: doca-hbn
  interfaces:
  - p0-sf
  - p1-sf
  - app-sf
  serviceDaemonSet:
    annotations:
      k8s.v1.cni.cncf.io/networks: |-
        [
        {"name": "iprequest", "interface": "ip_lo", "cni-args": {"poolNames": ["loopback"], "poolType": "cidrpool"}},
        {"name": "iprequest", "interface": "ip_pf2dpu2", "cni-args": {"poolNames": ["pool1"], "poolType": "cidrpool", "allocateDefaultGateway": true}}
        ]
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 1.0.1
      chart: doca-hbn
    values:
      image:
        repository: nvcr.io/nvidia/doca/doca_hbn
        tag: 2.4.1-doca2.9.1
      resources:
        memory: 6Gi
        nvidia.com/bf_sf: 3
      configuration:
        perDPUValuesYAML: |
          - hostnamePattern: "*"
            values:
              bgp_peer_group: hbn
          - hostnamePattern: "worker1*"
            values:
              bgp_autonomous_system: 65101
          - hostnamePattern: "worker2*"
            values:
              bgp_autonomous_system: 65201
        startupYAMLJ2: |
          - header:
              model: BLUEFIELD
              nvue-api-version: nvue_v1
              rev-id: 1.0
              version: HBN 2.4.0
          - set:
              interface:
                lo:
                  ip:
                    address:
                      {{ ipaddresses.ip_lo.ip }}/32: {}
                  type: loopback
                p0_if,p1_if:
                  type: swp
                  link:
                    mtu: 9000
                pf2dpu2_if:
                  ip:
                    address:
                      {{ ipaddresses.ip_pf2dpu2.cidr }}: {}
                  type: swp
                  link:
                    mtu: 9000
              router:
                bgp:
                  autonomous-system: {{ config.bgp_autonomous_system }}
                  enable: on
                  graceful-restart:
                    mode: full
                  router-id: {{ ipaddresses.ip_lo.ip }}
              vrf:
                default:
                  router:
                    bgp:
                      address-family:
                        ipv4-unicast:
                          enable: on
                          redistribute:
                            connected:
                              enable: on
                        ipv6-unicast:
                          enable: on
                          redistribute:
                            connected:
                              enable: on
                      enable: on
                      neighbor:
                        p0_if:
                          peer-group: {{ config.bgp_peer_group }}
                          type: unnumbered
                        p1_if:
                          peer-group: {{ config.bgp_peer_group }}
                          type: unnumbered
                      path-selection:
                        multipath:
                          aspath-ignore: on
                      peer-group:
                        {{ config.bgp_peer_group }}:
                          remote-as: external
DOCA Telemetry Service DPUService to deploy DTS to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
  name: doca-telemetry-service
  namespace: dpf-operator-system
spec:
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 0.2.3
      chart: doca-telemetry
Blueman DPUService to deploy Blueman to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUService
metadata:
  name: doca-blueman-service
  namespace: dpf-operator-system
spec:
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 1.0.5
      chart: doca-blueman
OVN DPUServiceCredentialRequest to allow cross cluster communication
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceCredentialRequest
metadata:
  name: ovn-dpu
  namespace: dpf-operator-system 
spec:
  serviceAccount:
    name: ovn-dpu
    namespace: dpf-operator-system 
  duration: 24h
  type: tokenFile
  secret:
    name: ovn-dpu
    namespace: dpf-operator-system 
  metadata:
    labels:
      dpu.nvidia.com/image-pull-secret: ""
DPUServiceInterfaces for physical ports on the DPU
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p0
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            uplink: "p0"
        spec:
          interfaceType: physical
          physical:
            interfaceName: p0
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p1
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            uplink: "p1"
        spec:
          interfaceType: physical
          physical:
            interfaceName: p1
OVN DPUServiceInterface to define the ports attached to OVN workloads on the DPU
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: ovn
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            port: ovn
        spec:
          interfaceType: ovn
HBN DPUServiceInterfaces to define the ports attached to HBN workloads on the DPU
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: app-sf 
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            svc.dpu.nvidia.com/interface: "app_sf"
            svc.dpu.nvidia.com/service: doca-hbn
        spec:
          interfaceType: service
          service:
            serviceID: doca-hbn
            network: mybrhbn
            ## NOTE: Interfaces inside the HBN pod must have the `_if` suffix due to a naming convention in HBN.
            interfaceName: pf2dpu2_if
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p0-sf
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            svc.dpu.nvidia.com/interface: "p0_sf"
            svc.dpu.nvidia.com/service: doca-hbn
        spec:
          interfaceType: service
          service:
            serviceID: doca-hbn
            network: mybrhbn
            ## NOTE: Interfaces inside the HBN pod must have the `_if` suffix due to a naming convention in HBN.
            interfaceName: p0_if
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p1-sf
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            svc.dpu.nvidia.com/interface: "p1_sf"
            svc.dpu.nvidia.com/service: doca-hbn
        spec:
          interfaceType: service
          service:
            serviceID: doca-hbn
            network: mybrhbn
            ## NOTE: Interfaces inside the HBN pod must have the `_if` suffix due to a naming convention in HBN.
            interfaceName: p1_if
DPUServiceFunctionChain to define the HBN-OVN ServiceFunctionChain
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceChain
metadata:
  name: hbn-to-fabric
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        spec:
          switches:
            - ports:
              - serviceInterface:
                  matchLabels:
                    uplink: p0
              - serviceInterface:
                  matchLabels:
                    svc.dpu.nvidia.com/service: doca-hbn
                    svc.dpu.nvidia.com/interface: "p0_sf"
            - ports:
              - serviceInterface:
                  matchLabels:
                    uplink: p1
              - serviceInterface:
                  matchLabels:
                    svc.dpu.nvidia.com/service: doca-hbn
                    svc.dpu.nvidia.com/interface: "p1_sf"
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceChain
metadata:
  name: ovn-to-hbn
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        spec:
          switches:
            - ports:
              - serviceInterface:
                  matchLabels:
                    svc.dpu.nvidia.com/service: doca-hbn
                    svc.dpu.nvidia.com/interface: "app_sf"
              - serviceInterface:
                  matchLabels:
                    port: ovn
DPUServiceIPAM to set up IP Address Management on the DPUCluster
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceIPAM
metadata:
  name: pool1
  namespace: dpf-operator-system
spec:
  ipv4Network:
    network: "10.0.120.0/22"
    gatewayIndex: 3
    prefixSize: 29
DPUServiceIPAM for the loopback interface in HBN
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceIPAM
metadata:
  name: loopback
  namespace: dpf-operator-system
spec:
  ipv4Network:
    network: "11.0.0.0/24"
    prefixSize: 32

5.2. With DPUDeployment

In this mode the user is expected to create a DPUDeployment object that reflects a set of DPUServices that should run on a set of DPUs.

If you want to learn more about DPUDeployments, feel free to check the DPUDeployment documentation.

Create the DPUDeployment, DPUServiceConfig, DPUServiceTemplate and other necessary objects

A number of environment variables must be set before running this command.

cat manifests/05.2-dpudeployment-installation/*.yaml | envsubst | kubectl apply -f - 

This will deploy the following objects:

BFB to download Bluefield Bitstream to a shared volume
---
apiVersion: provisioning.dpu.nvidia.com/v1alpha1
kind: BFB
metadata:
  name: bf-bundle
  namespace: dpf-operator-system
spec:
  url: $BLUEFIELD_BITSTREAM
DPUDeployment to provision DPUs on worker nodes
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUDeployment
metadata:
  name: ovn-hbn
  namespace: dpf-operator-system
spec:
  dpus:
    bfb: bf-bundle
    flavor: dpf-provisioning-hbn-ovn
    dpuSets:
    - nameSuffix: "dpuset1"
      nodeSelector:
        matchLabels:
          feature.node.kubernetes.io/dpu-enabled: "true"
  services:
    ovn:
      serviceTemplate: ovn
      serviceConfiguration: ovn
    hbn:
      serviceTemplate: hbn
      serviceConfiguration: hbn
    dts:
      serviceTemplate: dts
      serviceConfiguration: dts
    blueman:
      serviceTemplate: blueman
      serviceConfiguration: blueman
  serviceChains:
  - ports:
    - serviceInterface:
        matchLabels:
          uplink: p0
    - service:
        name: hbn
        interface: p0_if
  - ports:
    - serviceInterface:
        matchLabels:
          uplink: p1
    - service:
        name: hbn
        interface: p1_if
  - ports:
    - serviceInterface:
        matchLabels:
          port: ovn
    - service:
        name: hbn
        interface: pf2dpu2_if
OVN DPUServiceConfig and DPUServiceTemplate to deploy OVN workloads to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
  name: ovn
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "ovn"
  serviceConfiguration:
    helmChart:
      values:
        k8sAPIServer: https://$TARGETCLUSTER_API_SERVER_HOST:$TARGETCLUSTER_API_SERVER_PORT
        podNetwork: $POD_CIDR/24
        serviceNetwork: $SERVICE_CIDR
        ovnkube-node-dpu:
          kubernetesSecretName: "ovn-dpu" # user needs to populate based on DPUServiceCredentialRequest
          vtepCIDR: "10.0.120.0/22" # user needs to populate based on DPUServiceIPAM
          hostCIDR: $TARGETCLUSTER_NODE_CIDR # user needs to populate
          ipamPool: "pool1" # user needs to populate based on DPUServiceIPAM
          ipamPoolType: "cidrpool" # user needs to populate based on DPUServiceIPAM
          ipamVTEPIPIndex: 0
          ipamPFIPIndex: 1
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceTemplate
metadata:
  name: ovn
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "ovn"
  helmChart:
    source:
      repoURL: oci://ghcr.io/nvidia
      chart: ovn-kubernetes-chart
      version: $DPF_VERSION
    values:
      tags:
        ovn-kubernetes-resource-injector: false
        ovnkube-node-dpu: true
        ovnkube-node-dpu-host: false
        ovnkube-single-node-zone: false
        ovnkube-control-plane: false
      global:
        gatewayOpts: "--gateway-interface=br-ovn --gateway-uplink-port=puplinkbrovn"
        imagePullSecretName: dpf-pull-secret
HBN DPUServiceConfig and DPUServiceTemplate to deploy HBN workloads to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
  name: hbn
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "hbn"
  serviceConfiguration:
    serviceDaemonSet:
      annotations:
        k8s.v1.cni.cncf.io/networks: |-
          [
          {"name": "iprequest", "interface": "ip_lo", "cni-args": {"poolNames": ["loopback"], "poolType": "cidrpool"}},
          {"name": "iprequest", "interface": "ip_pf2dpu2", "cni-args": {"poolNames": ["pool1"], "poolType": "cidrpool", "allocateDefaultGateway": true}}
          ]
    helmChart:
      values:
        configuration:
          perDPUValuesYAML: |
            - hostnamePattern: "*"
              values:
                bgp_peer_group: hbn
            - hostnamePattern: "worker1*"
              values:
                bgp_autonomous_system: 65101"
            - hostnamePattern: "worker2*"
              values:
                bgp_autonomous_system: 65201"
          startupYAMLJ2: |
            - header:
                model: BLUEFIELD
                nvue-api-version: nvue_v1
                rev-id: 1.0
                version: HBN 2.4.0
            - set:
                interface:
                  lo:
                    ip:
                      address:
                        {{ ipaddresses.ip_lo.ip }}/32: {}
                    type: loopback
                  p0_if,p1_if:
                    type: swp
                    link:
                      mtu: 9000
                  pf2dpu2_if:
                    ip:
                      address:
                        {{ ipaddresses.ip_pf2dpu2.cidr }}: {}
                    type: swp
                    link:
                      mtu: 9000
                router:
                  bgp:
                    autonomous-system: {{ config.bgp_autonomous_system }}
                    enable: on
                    graceful-restart:
                      mode: full
                    router-id: {{ ipaddresses.ip_lo.ip }}
                vrf:
                  default:
                    router:
                      bgp:
                        address-family:
                          ipv4-unicast:
                            enable: on
                            redistribute:
                              connected:
                                enable: on
                          ipv6-unicast:
                            enable: on
                            redistribute:
                              connected:
                                enable: on
                        enable: on
                        neighbor:
                          p0_if:
                            peer-group: {{ config.bgp_peer_group }}
                            type: unnumbered
                          p1_if:
                            peer-group: {{ config.bgp_peer_group }}
                            type: unnumbered
                        path-selection:
                          multipath:
                            aspath-ignore: on
                        peer-group:
                          {{ config.bgp_peer_group }}:
                            remote-as: external

  interfaces:
    ## NOTE: Interfaces inside the HBN pod must have the `_if` suffix due to a naming convention in HBN.
  - name: p0_if
    network: mybrhbn
  - name: p1_if
    network: mybrhbn
  - name: pf2dpu2_if
    network: mybrhbn
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceTemplate
metadata:
  name: hbn
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "hbn"
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 1.0.1
      chart: doca-hbn
    values:
      image:
        repository: nvcr.io/nvidia/doca/doca_hbn
        tag: 2.4.1-doca2.9.1
      resources:
        memory: 6Gi
        nvidia.com/bf_sf: 3
DOCA Telemetry Service DPUServiceConfig and DPUServiceTemplate to deploy DTS to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
  name: dts
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "dts"
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceTemplate
metadata:
  name: dts
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "dts"
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 0.2.3
      chart: doca-telemetry
Blueman DPUServiceConfig and DPUServiceTemplate to deploy Blueman to the DPUs
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceConfiguration
metadata:
  name: blueman
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "blueman"
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceTemplate
metadata:
  name: blueman
  namespace: dpf-operator-system
spec:
  deploymentServiceName: "blueman"
  helmChart:
    source:
      repoURL: https://helm.ngc.nvidia.com/nvidia/doca
      version: 1.0.5
      chart: doca-blueman
OVN DPUServiceCredentialRequest to allow cross cluster communication
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceCredentialRequest
metadata:
  name: ovn-dpu
  namespace: dpf-operator-system 
spec:
  serviceAccount:
    name: ovn-dpu
    namespace: dpf-operator-system 
  duration: 24h
  type: tokenFile
  secret:
    name: ovn-dpu
    namespace: dpf-operator-system 
  metadata:
    labels:
      dpu.nvidia.com/image-pull-secret: ""
DPUServiceInterfaces for physical ports on the DPU
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p0
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            uplink: "p0"
        spec:
          interfaceType: physical
          physical:
            interfaceName: p0
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: p1
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            uplink: "p1"
        spec:
          interfaceType: physical
          physical:
            interfaceName: p1
OVN DPUServiceInterface to define the ports attached to OVN workloads on the DPU
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceInterface
metadata:
  name: ovn
  namespace: dpf-operator-system
spec:
  template:
    spec:
      template:
        metadata:
          labels:
            port: ovn
        spec:
          interfaceType: ovn
DPUServiceIPAM to set up IP Address Management on the DPUCluster
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceIPAM
metadata:
  name: pool1
  namespace: dpf-operator-system
spec:
  ipv4Network:
    network: "10.0.120.0/22"
    gatewayIndex: 3
    prefixSize: 29
DPUServiceIPAM for the loopback interface in HBN
---
apiVersion: svc.dpu.nvidia.com/v1alpha1
kind: DPUServiceIPAM
metadata:
  name: loopback
  namespace: dpf-operator-system
spec:
  ipv4Network:
    network: "11.0.0.0/24"
    prefixSize: 32

Verification

These verification commands, which are common to both the 5.1 DPUService and 5.2 DPUDeployment installations, may need to be run multiple times to ensure the condition is met.

Note that when using the DPUDeployment, the DPUService name will have the DPUDeployment name added as prefix. For example, ovn-hbn-doca-hbn. Use the correct name for the verification.

Verify the DPU and Service installation with:

## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system  dpuservices doca-blueman-service doca-hbn doca-telemetry-service
## Ensure the DPUServiceIPAMs have been reconciled
kubectl wait --for=condition=DPUIPAMObjectReconciled --namespace dpf-operator-system dpuserviceipam --all
## Ensure the DPUServiceInterfaces have been reconciled
kubectl wait --for=condition=ServiceInterfaceSetReconciled --namespace dpf-operator-system dpuserviceinterface --all
## Ensure the DPUServiceChains have been reconciled
kubectl wait --for=condition=ServiceChainSetReconciled --namespace dpf-operator-system dpuservicechain --all

With DPUDeployment, verify the Service installation with:

## Ensure the DPUServices are created and have been reconciled.
kubectl wait --for=condition=ApplicationsReconciled --namespace dpf-operator-system  dpuservices ovn-hbn-doca-hbn

6. Test traffic

Add worker nodes to the cluster

At this point workers should be added to the cluster. Each worker node should be configured in line with the prerequisites and the specific OVN Kubernetes prerequisites.

As workers are added to the cluster DPUs will be provisioned and DPUServices will begin to be spun up.

Deploy test pods

kubectl apply -f manifests/06-test-traffic

HBN and OVN functionality can be tested by pinging between the pods and services deployed in the default namespace.

TODO: Add specific user commands to test traffic.

7. Deletion and clean up

For DPF deletion follows a specific order defined below. The OVN Kubernetes primary CNI can not be safely deleted from the cluster.

Delete the test pods

kubectl delete -f manifests/06-test-traffic --wait

Delete DPF CNI acceleration components

kubectl delete -f manifests/04-enable-accelerated-cni --wait
helm uninstall -n nvidia-network-operator network-operator --wait

## Run `helm install` with the original values to delete the OVN Kubernetes webhook.
## Note: Uninstalling OVN Kubernetes as primary CNI is not supported but this command must be run to remove the webhook and restore a functioning cluster.
envsubst < manifests/01-cni-installation/helm-values/ovn-kubernetes.yml | helm upgrade --install -n ovn-kubernetes ovn-kubernetes oci://ghcr.io/nvidia/ovn-kubernetes-chart --version $DPF_VERSION --values -

Delete the DPF Operator system and DPF Operator

First we have to delete some DPUServiceInterfaces. This is necessary because of a known issue during uninstallation.

kubectl delete -n dpf-operator-system dpuserviceinterface p0 p1 ovn --wait

Then we can delete the config and system namespace.

kubectl delete -n dpf-operator-system dpfoperatorconfig dpfoperatorconfig --wait
helm uninstall -n dpf-operator-system dpf-operator --wait

Delete DPF Operator dependencies

helm uninstall -n local-path-provisioner local-path-provisioner --wait 
kubectl delete ns local-path-provisioner --wait 
helm uninstall -n cert-manager cert-manager --wait 
kubectl -n dpf-operator-system delete secret dpf-pull-secret --wait
kubectl delete pv bfb-pv
kubectl delete namespace dpf-operator-system dpu-cplane-tenant1 cert-manager nvidia-network-operator --wait

Note: there can be a race condition with deleting the underlying Kamaji cluster which runs the DPU cluster control plane in this guide. If that happens it may be necessary to remove finalizers manually from DPUCluster and Datastore objects.

Limitations of DPF Setup

Host network pod services

The Kubelet process on the Kubernetes nodes use the OOB interface IP address to register in Kubernetes. This means that the nodes have the OOB IP addresses as node IP addresses. This means that pods using host networking have the OOB IP address of the hosts as pod IP address. However, that interface is not accelerated. This means that any component using the addresses of the pods using host networking will not benefit from hardware acceleration and high-speed ports.

For example, this means that when creating a Kubernetes NodePort service selecting pods using host networking, even if the user uses the high-speed IP of the host, the traffic will not be accelerated. In order to solve this, it is possible to create dedicated endpointSlices that contain the host high-speed port IP addresses instead of OOB port IP addresses. This way, the entire path to the pods will be accelerated and benefit from high performances, if the user uses the high speed IP address of the host with the nodePort port. This requires the workload running on the pod with host networking to also listen on the high-speed port IP address.