Note: the name of the feature gates is case sensitive.
The snippet above assumes KubeVirt is installed in the kubevirt namespace. Change the namespace to suite your installation.
"},{"location":"cluster_admin/activating_feature_gates/#list-of-feature-gates","title":"List of feature gates","text":"
The list of feature gates (which evolve in time) can be checked directly from the source code.
"},{"location":"cluster_admin/annotations_and_labels/","title":"Annotations and labels","text":"
KubeVirt builds on and exposes a number of labels and annotations that either are used for internal implementation needs or expose useful information to API users. This page documents the labels and annotations that may be useful for regular API consumers. This page intentionally does not list labels and annotations that are merely part of internal implementation.
Note: Annotations and labels that are not specific to KubeVirt are also documented here.
This label marks resources that belong to KubeVirt. An optional value may indicate which specific KubeVirt component a resource belongs to. This label may be used to list all resources that belong to KubeVirt, for example, to uninstall it from a cluster.
This annotation is regularly updated by virt-handler to help determine if a particular node is alive and hence should be available for new virtual machine instance scheduling.
The KubeVirt VirtualMachineInstance API is implemented using a Kubernetes Custom Resource Definition (CRD). Because of this, KubeVirt is able to leverage a couple of features Kubernetes provides in order to perform validation checks on our API as objects created and updated on the cluster.
"},{"location":"cluster_admin/api_validation/#how-api-validation-works","title":"How API Validation Works","text":""},{"location":"cluster_admin/api_validation/#crd-openapiv3-schema","title":"CRD OpenAPIv3 Schema","text":"
The KubeVirt API is registered with Kubernetes at install time through a series of CRD definitions. KubeVirt includes an OpenAPIv3 schema in these definitions which indicates to the Kubernetes Apiserver some very basic information about our API, such as what fields are required and what type of data is expected for each value.
This OpenAPIv3 schema validation is installed automatically and requires no thought on the users part to enable.
"},{"location":"cluster_admin/api_validation/#admission-control-webhooks","title":"Admission Control Webhooks","text":"
The OpenAPIv3 schema validation is limited. It only validates the general structure of a KubeVirt object looks correct. It does not however verify that the contents of that object make sense.
With OpenAPIv3 validation alone, users can easily make simple mistakes (like not referencing a volume's name correctly with a disk) and the cluster will still accept the object. However, the VirtualMachineInstance will of course not start if these errors in the API exist. Ideally we'd like to catch configuration issues as early as possible and not allow an object to even be posted to the cluster if we can detect there's a problem with the object's Spec.
In order to perform this advanced validation, KubeVirt implements its own admission controller which is registered with kubernetes as an admission controller webhook. This webhook is registered with Kubernetes at install time. As KubeVirt objects are posted to the cluster, the Kubernetes API server forwards Creation requests to our webhook for validation before persisting the object into storage.
Note however that the KubeVirt admission controller requires features to be enabled on the cluster in order to be enabled.
"},{"location":"cluster_admin/api_validation/#enabling-kubevirt-admission-controller-on-kubernetes","title":"Enabling KubeVirt Admission Controller on Kubernetes","text":"
When provisioning a new Kubernetes cluster, ensure that both the MutatingAdmissionWebhook and ValidatingAdmissionWebhook values are present in the Apiserver's --admission-control cli argument.
Below is an example of the --admission-control values we use during development
Note that the old --admission-control flag was deprecated in 1.10 and replaced with --enable-admission-plugins. MutatingAdmissionWebhook and ValidatingAdmissionWebhook are enabled by default.
"},{"location":"cluster_admin/api_validation/#enabling-kubevirt-admission-controller-on-okd","title":"Enabling KubeVirt Admission Controller on OKD","text":"
OKD also requires the admission control webhooks to be enabled at install time. The process is slightly different though. With OKD, we enable webhooks using an admission plugin.
These admission control plugins can be configured in openshift-ansible by setting the following value in ansible inventory file.
KubeVirt authorization is performed using Kubernetes's Resource Based Authorization Control system (RBAC). RBAC allows cluster admins to grant access to cluster resources by binding RBAC roles to users.
For example, an admin creates an RBAC role that represents the permissions required to create a VirtualMachineInstance. The admin can then bind that role to users in order to grant them the permissions required to launch a VirtualMachineInstance.
With RBAC roles, admins can grant users targeted access to various KubeVirt features.
The kubevirt.io:view ClusterRole gives users permissions to view all KubeVirt resources in the cluster. The permissions to create, delete, modify or access any KubeVirt resources beyond viewing the resource's spec are not included in this role. This means a user with this role could see that a VirtualMachineInstance is running, but neither shutdown nor gain access to that VirtualMachineInstance via console/VNC.
The kubevirt.io:edit ClusterRole gives users permissions to modify all KubeVirt resources in the cluster. For example, a user with this role can create new VirtualMachineInstances, delete VirtualMachineInstances, and gain access to both console and VNC.
The kubevirt.io:admin ClusterRole grants users full permissions to all KubeVirt resources, including the ability to delete collections of resources.
The admin role also grants users access to view and modify the KubeVirt runtime config. This config exists within the Kubevirt Custom Resource under the configuration key in the namespace the KubeVirt operator is running.
NOTE Users are only guaranteed the ability to modify the kubevirt runtime configuration if a ClusterRoleBinding is used. A RoleBinding will work to provide kubevirt CR access only if the RoleBinding targets the same namespace that the kubevirt CR exists in.
"},{"location":"cluster_admin/authorization/#binding-default-clusterroles-to-users","title":"Binding Default ClusterRoles to Users","text":"
The KubeVirt default ClusterRoles are granted to users by creating either a ClusterRoleBinding or RoleBinding object.
"},{"location":"cluster_admin/authorization/#binding-within-all-namespaces","title":"Binding within All Namespaces","text":"
With a ClusterRoleBinding, users receive the permissions granted by the role across all namespaces.
"},{"location":"cluster_admin/authorization/#binding-within-single-namespace","title":"Binding within Single Namespace","text":"
With a RoleBinding, users receive the permissions granted by the role only within a targeted namespace.
"},{"location":"cluster_admin/authorization/#extending-kubernetes-default-roles-with-kubevirt-permissions","title":"Extending Kubernetes Default Roles with KubeVirt permissions","text":"
The aggregated ClusterRole Kubernetes feature facilitates combining multiple ClusterRoles into a single aggregated ClusterRole. This feature is commonly used to extend the default Kubernetes roles with permissions to access custom resources that do not exist in the Kubernetes core.
In order to extend the default Kubernetes roles to provide permission to access KubeVirt resources, we need to add the following labels to the KubeVirt ClusterRoles.
By adding these labels, any user with a RoleBinding or ClusterRoleBinding involving one of the default Kubernetes roles will automatically gain access to the equivalent KubeVirt roles as well.
More information about aggregated cluster roles can be found here
If the default KubeVirt ClusterRoles are not expressive enough, admins can create their own custom RBAC roles to grant user access to KubeVirt resources. The creation of a RBAC role is inclusive only, meaning there's no way to deny access. Instead access is only granted.
Below is an example of what KubeVirt's default admin ClusterRole looks like. A custom RBAC role can be created by reducing the permissions in this example role.
"},{"location":"cluster_admin/customize_components/","title":"Customize components","text":""},{"location":"cluster_admin/customize_components/#customize-kubevirt-components","title":"Customize KubeVirt Components","text":""},{"location":"cluster_admin/customize_components/#customize-components-using-patches","title":"Customize components using patches","text":"
If the patch created is invalid KubeVirt will not be able to update or deploy the system. This is intended for special use cases and should not be used unless you know what you are doing.
Valid resource types are: Deployment, DaemonSet, Service, ValidatingWebhookConfiguraton, MutatingWebhookConfiguration, APIService, and CertificateSecret. More information can be found in the API spec.
The above example will update the virt-controller deployment to have an annotation in it's metadata that says patch: true and will remove the livenessProbe from the container definition.
If the flags are invalid or become invalid on update the component will not be able to run
By using the customize flag option, whichever component the flags are to be applied to, all default flags will be removed and only the flags specified will be used. The available resources to change the flags on are api, controller and handler. You can find our more details about the API in the API spec.
"},{"location":"cluster_admin/device_status_on_Arm64/","title":"Device Status on Arm64","text":"
This page is based on https://github.com/kubevirt/kubevirt/issues/8916
Devices Description Status on Arm64 DisableHotplug supported Disks sata/ virtio bus support virtio bus Watchdog i6300esb not supported UseVirtioTransitional virtio-transitional supported Interfaces e1000/ virtio-net-device support virtio-net-device Inputs tablet virtio/usb bus supported AutoattachPodInterface connect to /net/tun (devices.kubevirt.io/tun) supported AutoattachGraphicsDevice create a virtio-gpu device / vga device support virtio-gpu AutoattachMemBalloon virtio-balloon-pci-non-transitional supported AutoattachInputDevice auto add tablet supported Rng virtio-rng-pci-non-transitional host:/dev/urandom supported BlockMultiQueue \"driver\":\"virtio-blk-pci-non-transitional\",\"num-queues\":$cpu_number supported NetworkInterfaceMultiQueue -netdev tap,fds=21:23:24:25,vhost=on,vhostfds=26:27:28:29,id=hostua-default#fd number equals to queue number supported GPUs not verified Filesystems virtiofs, vhost-user-fs-pci, need to enable featuregate: ExperimentalVirtiofsSupport supported ClientPassthrough https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/on x86_64, iommu need to be enabled not verified Sound ich9/ ac97 not supported TPM tpm-tis-devicehttps://qemu.readthedocs.io/en/latest/specs/tpm.html supported Sriov vfio-pci not verified"},{"location":"cluster_admin/feature_gate_status_on_Arm64/","title":"Feature Gate Status on Arm64","text":"
This page is based on https://github.com/kubevirt/kubevirt/issues/9749 It records the feature gate status on Arm64 platform. Here is the explanation of the status:
Supported: the feature gate support on Arm64 platform.
Not supported yet: there are some dependencies of the feature gate not support Arm64, so this feature does not support for now. We may support the dependencies in the future.
Not supported: The feature gate is not support on Arm64.
Not verified: The feature has not been verified yet.
FEATURE GATE STATUS NOTES ExpandDisksGate Not supported yet CDI is needed CPUManager Supported use taskset to do CPU pinning, do not support kvm-hint-dedicated (this is only works on x86 platform) NUMAFeatureGate Not supported yet Need to support Hugepage on Arm64 IgnitionGate Supported This feature is only used for CoreOS/RhCOS LiveMigrationGate Supported Verified live migration with masquerade network SRIOVLiveMigrationGate Not verified Need two same Machine and SRIOV device HypervStrictCheckGate Not supported Hyperv does not work on Arm64 SidecarGate Supported GPUGate Not verified Need GPU device HostDevicesGate Not verified Need GPU or sound card SnapshotGate Supported Need snapshotter support https://github.com/kubernetes-csi/external-snapshotter VMExportGate Partially supported Need snapshotter support https://kubevirt.io/user-guide/operations/export_api/, support exporting pvc, not support exporting DataVolumes and MemoryDump which rely on CDI HotplugVolumesGate Not supported yet Rely on datavolume and CDI HostDiskGate Supported VirtIOFSGate Supported MacvtapGate Not supported yet quay.io/kubevirt/macvtap-cni not support Arm64, https://github.com/kubevirt/macvtap-cni#deployment PasstGate Supported VM have same ip with pods; start a process for network /usr/bin/passt --runas 107 -e -t 8080 DownwardMetricsFeatureGate need more information It used to let guest get host information, failed on both Arm64 and x86_64. The block is successfully attached and can see the following information: -blockdev {\"driver\":\"file\",\"filename\":\"/var/run/kubevirt-private/downwardapi-disks/vhostmd0\",\"node-name\":\"libvirt-1-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}But unable to get information via vm-dump-metrics:LIBMETRICS: read_mdisk(): Unable to read metrics diskLIBMETRICS: get_virtio_metrics(): Unable to export metrics: open(/dev/virtio-ports/org.github.vhostmd.1) No such file or directoryLIBMETRICS: get_virtio_metrics(): Unable to read metrics NonRootDeprecated Supported NonRoot Supported Root Supported ClusterProfiler Supported WorkloadEncryptionSEV Not supported SEV is only available on x86_64 VSOCKGate Supported HotplugNetworkIfacesGate Not supported yet Need to setup multus-cni and multus-dynamic-networks-controller: https://github.com/k8snetworkplumbingwg/multus-cni cat ./deployments/multus-daemonset-thick.yml \\| kubectl apply -f -https://github.com/k8snetworkplumbingwg/multus-dynamic-networks-controller kubectl apply -f manifests/dynamic-networks-controller.yaml Currently, the image ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick does not support Arm64 server. For more information please refer to https://github.com/k8snetworkplumbingwg/multus-cni/pull/1027. CommonInstancetypesDeploymentGate Not supported yet Support of common-instancetypes instancetypes needs to be tested, common-instancetypes preferences for ARM workloads are still missing"},{"location":"cluster_admin/gitops/","title":"Managing KubeVirt with GitOps","text":"
The GitOps way uses Git repositories as a single source of truth to deliver infrastructure as code. Automation is employed to keep the desired and the live state of clusters in sync at all times. This means any change to a repository is automatically applied to one or more clusters while changes to a cluster will be automatically reverted to the state described in the single source of truth.
With GitOps the separation of testing and production environments, improving the availability of applications and working with multi-cluster environments becomes considerably easier.
A few requirements need to be met before you can begin:
Kubernetes cluster or derivative (such as OpenShift) based on a one of the latest three Kubernetes releases that are out at the time the KubeVirt release is made.
Kubernetes apiserver must have --allow-privileged=true in order to run KubeVirt's privileged DaemonSet.
KubeVirt is currently supported on the following container runtimes:
containerd
crio (with runv)
Other container runtimes, which do not use virtualization features, should work too. However, the mentioned ones are the main target.
"},{"location":"cluster_admin/installation/#integration-with-apparmor","title":"Integration with AppArmor","text":"
In most of the scenarios, KubeVirt can run normally on systems with AppArmor. However, there are several known use cases that may require additional user interaction.
On a system with AppArmor enabled, the locally installed profiles may block the execution of the KubeVirt privileged containers. That usually results in initialization failure of the virt-handler pod:
Here, the host AppArmor profile for libvirtd does not allow the execution of the /usr/libexec/qemu-kvm binary. In the future this will hopefully work out of the box (tracking issue), but until then there are a couple of possible workarounds.
The first (and simplest) one is to remove the libvirt package from the host: assuming the host is a dedicated Kubernetes node, you likely won't need it anyway.
If you actually need libvirt to be present on the host, then you can add the following rule to the AppArmor profile for libvirtd (usually /etc/apparmor.d/usr.sbin.libvirtd):
# vim /etc/apparmor.d/usr.sbin.libvirtd\n...\n/usr/libexec/qemu-kvm PUx,\n...\n# apparmor_parser -r /etc/apparmor.d/usr.sbin.libvirtd # or systemctl reload apparmor.service\n
The default AppArmor profile used by the container runtimes usually denies mount call for the workloads. That may prevent from running VMs with VirtIO-FS. This is a known issue. The current workaround is to run such a VM as unconfined by adding the following annotation to the VM or VMI object:
Hardware with virtualization support is recommended. You can use virt-host-validate to ensure that your hosts are capable of running virtualization workloads:
$ virt-host-validate qemu\n QEMU: Checking for hardware virtualization : PASS\n QEMU: Checking if device /dev/kvm exists : PASS\n QEMU: Checking if device /dev/kvm is accessible : PASS\n QEMU: Checking if device /dev/vhost-net exists : PASS\n QEMU: Checking if device /dev/net/tun exists : PASS\n...\n
SELinux-enabled nodes need Container-selinux installed. The minimum version is documented inside the kubevirt/kubevirt repository, in docs/getting-started.md, under \"SELinux support\".
For (older) release branches that don't specify a container-selinux version, version 2.170.0 or newer is recommended.
"},{"location":"cluster_admin/installation/#installing-kubevirt-on-kubernetes","title":"Installing KubeVirt on Kubernetes","text":"
KubeVirt can be installed using the KubeVirt operator, which manages the lifecycle of all the KubeVirt core components. Below is an example of how to install KubeVirt's latest official release. It supports to deploy KubeVirt on both x86_64 and Arm64 platforms.
# Point at latest release\n$ export RELEASE=$(curl https://storage.googleapis.com/kubevirt-prow/release/kubevirt/kubevirt/stable.txt)\n# Deploy the KubeVirt operator\n$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml\n# Create the KubeVirt CR (instance deployment request) which triggers the actual installation\n$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml\n# wait until all KubeVirt components are up\n$ kubectl -n kubevirt wait kv kubevirt --for condition=Available\n
If hardware virtualization is not available, then a software emulation fallback can be enabled using by setting in the KubeVirt CR spec.configuration.developerConfiguration.useEmulation to true as follows:
Note: Prior to release v0.20.0 the condition for the kubectl wait command was named \"Ready\" instead of \"Available\"
Note: Prior to KubeVirt 0.34.2 a ConfigMap called kubevirt-config in the install-namespace was used to configure KubeVirt. Since 0.34.2 this method is deprecated. The configmap still has precedence over configuration on the CR exists, but it will not receive future updates and you should migrate any custom configurations to spec.configuration on the KubeVirt CR.
All new components will be deployed under the kubevirt namespace:
Once privileges are granted, the KubeVirt can be deployed as described above.
"},{"location":"cluster_admin/installation/#web-user-interface-on-okd","title":"Web user interface on OKD","text":"
No additional steps are required to extend OKD's web console for KubeVirt.
The virtualization extension is automatically enabled when KubeVirt deployment is detected.
"},{"location":"cluster_admin/installation/#from-service-catalog-as-an-apb","title":"From Service Catalog as an APB","text":"
You can find KubeVirt in the OKD Service Catalog and install it from there. In order to do that please follow the documentation in the KubeVirt APB repository.
"},{"location":"cluster_admin/installation/#installing-kubevirt-on-k3os","title":"Installing KubeVirt on k3OS","text":"
The following configuration needs to be added to all nodes prior KubeVirt deployment:
k3os:\n modules:\n - kvm\n - vhost_net\n
Once nodes are restarted with this configuration, the KubeVirt can be deployed as described above.
"},{"location":"cluster_admin/installation/#installing-the-daily-developer-builds","title":"Installing the Daily Developer Builds","text":"
KubeVirt releases daily a developer build from the current main branch. One can see when the last release happened by looking at our nightly-build-jobs.
To install the latest developer build, run the following commands:
KubeVirt alone does not bring any additional network plugins, it just allows user to utilize them. If you want to attach your VMs to multiple networks (Multus CNI) or have full control over L2 (OVS CNI), you need to deploy respective network plugins. For more information, refer to OVS CNI installation guide.
Note: KubeVirt Ansible network playbook installs these plugins by default.
You can restrict the placement of the KubeVirt components across your cluster nodes by editing the KubeVirt CR:
The placement of the KubeVirt control plane components (virt-controller, virt-api) is governed by the .spec.infra.nodePlacement field in the KubeVirt CR.
The placement of the virt-handler DaemonSet pods (and consequently, the placement of the VM workloads scheduled to the cluster) is governed by the .spec.workloads.nodePlacement field in the KubeVirt CR.
For each of these .nodePlacement objects, the .affinity, .nodeSelector and .tolerations sub-fields can be configured. See the description in the API reference for further information about using these fields.
For example, to restrict the virt-controller and virt-api pods to only run on the control-plane nodes:
"},{"location":"cluster_admin/ksm/#enabling-ksm-through-kubevirt-cr","title":"Enabling KSM through KubeVirt CR","text":"
KSM can be enabled on nodes by spec.configuration.ksmConfiguration in the KubeVirt CR. ksmConfiguration instructs on which nodes KSM will be enabled, exposing a nodeLabelSelector. nodeLabelSelector is a LabelSelector and defines the filter, based on the node labels. If a node's labels match the label selector term, then on that node, KSM will be enabled.
NOTE If nodeLabelSelector is nil KSM will not be enabled on any nodes. Empty nodeLabelSelector will enable KSM on every node.
"},{"location":"cluster_admin/ksm/#annotation-and-restore-mechanism","title":"Annotation and restore mechanism","text":"
On those nodes where KubeVirt enables the KSM via configuration, an annotation will be added (kubevirt.io/ksm-handler-managed). This annotation is an internal record to keep track of which nodes are currently managed by virt-handler, so that it is possible to distinguish which nodes should be restored in case of future ksmConfiguration changes.
Let's imagine this scenario:
There are 3 nodes in the cluster and one of them(node01) has KSM externally enabled.
An admin patches the KubeVirt CR adding a ksmConfiguration which enables ksm for node02 and node03.
After a while, an admin patches again the KubeVirt CR deleting the ksmConfiguration.
Thanks to the annotation, the virt-handler is able to disable ksm on only those nodes where it itself had enabled it(node02node03), leaving the others unchanged (node01).
KubeVirt can discover on which nodes KSM is enabled and will mark them with a special label (kubevirt.io/ksm-enabled) with value true. This label can be used to schedule the vms in nodes with KSM enabled or not.
Migration policies provides a new way of applying migration configurations to Virtual Machines. The policies can refine Kubevirt CR's MigrationConfiguration that sets the cluster-wide migration configurations. This way, the cluster-wide settings serve as a default that can be refined (i.e. changed, removed or added) by the migration policy.
Please bear in mind that migration policies are in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.
KubeVirt supports Live Migrations of Virtual Machine workloads. Before migration policies were introduced, migration settings could be configurable only on the cluster-wide scope by editing KubevirtCR's spec or more specifically MigrationConfiguration CRD.
Several aspects (although not all) of migration behaviour that can be customized are: - Bandwidth - Auto-convergence - Post/Pre-copy - Max number of parallel migrations - Timeout
Migration policies generalize the concept of defining migration configurations, so it would be possible to apply different configurations to specific groups of VMs.
Such capability can be useful for a lot of different use cases on which there is a need to differentiate between different workloads. Differentiation of different configurations could be needed because different workloads are considered to be in different priorities, security segregation, workloads with different requirements, help to converge workloads which aren't migration-friendly, and many other reasons.
Currently the MigrationPolicy spec will only include the following configurations from KubevirtCR's MigrationConfiguration (in the future more configurations that aren't part of Kubevirt CR are intended to be added):
All above fields are optional. When omitted, the configuration will be applied as defined in KubevirtCR's MigrationConfiguration. This way, KubevirtCR will serve as a configurable set of defaults for both VMs that are not bound to any MigrationPolicy and VMs that are bound to a MigrationPolicy that does not define all fields of the configurations.
"},{"location":"cluster_admin/migration_policies/#matching-policies-to-vms","title":"Matching Policies to VMs","text":"
Next in the spec are the selectors that define the group of VMs on which to apply the policy. The options to do so are the following.
This policy applies to the VMs in namespaces that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1\nkind: MigrationPolicy\n spec:\n selectors:\n namespaceSelector:\n hpc-workloads: true # Matches a key and a value \n
This policy applies for the VMs that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1\nkind: MigrationPolicy\n spec:\n selectors:\n virtualMachineInstanceSelector:\n workload-type: db # Matches a key and a value \n
It is possible that multiple policies apply to the same VMI. In such cases, the precedence is in the same order as the bullets above (VMI labels first, then namespace labels). It is not allowed to define two policies with the exact same selectors.
If multiple policies apply to the same VMI: * The most detailed policy will be applied, that is, the policy with the highest number of matching labels
If multiple policies match to a VMI with the same number of matching labels, the policies will be sorted by the lexicographic order of the matching labels keys. The first one in this order will be applied.
Before removing a kubernetes node from the cluster, users will want to ensure that VirtualMachineInstances have been gracefully terminated before powering down the node. Since all VirtualMachineInstances are backed by a Pod, the recommended method of evicting VirtualMachineInstances is to use the kubectl drain command, or in the case of OKD the oc adm drain command.
"},{"location":"cluster_admin/node_maintenance/#evict-all-vms-from-a-node","title":"Evict all VMs from a Node","text":"
Select the node you'd like to evict VirtualMachineInstances from by identifying the node from the list of cluster nodes.
kubectl get nodes
The following command will gracefully terminate all VMs on a specific node. Replace <node-name> with the name of the node where the eviction should occur.
Below is a break down of why each argument passed to the drain command is required.
kubectl drain <node-name> is selecting a specific node as a target for the eviction
--delete-local-data is a required flag that is necessary for removing any pod that utilizes an emptyDir volume. The VirtualMachineInstance Pod does use emptyDir volumes, however the data in those volumes are ephemeral which means it is safe to delete after termination.
--ignore-daemonsets=true is a required flag because every node running a VirtualMachineInstance will also be running our helper DaemonSet called virt-handler. DaemonSets are not allowed to be evicted using kubectl drain. By default, if this command encounters a DaemonSet on the target node, the command will fail. This flag tells the command it is safe to proceed with the eviction and to just ignore DaemonSets.
--force is a required flag because VirtualMachineInstance pods are not owned by a ReplicaSet or DaemonSet controller. This means kubectl can't guarantee that the pods being terminated on the target node will get re-scheduled replacements placed else where in the cluster after the pods are evicted. KubeVirt has its own controllers which manage the underlying VirtualMachineInstance pods. Each controller behaves differently to a VirtualMachineInstance being evicted. That behavior is outlined further down in this document.
--pod-selector=kubevirt.io=virt-launcher means only VirtualMachineInstance pods managed by KubeVirt will be removed from the node.
"},{"location":"cluster_admin/node_maintenance/#evict-all-vms-and-pods-from-a-node","title":"Evict all VMs and Pods from a Node","text":"
By removing the -pod-selector argument from the previous command, we can issue the eviction of all Pods on a node. This command ensures Pods associated with VMs as well as all other Pods are evicted from the target node.
"},{"location":"cluster_admin/node_maintenance/#evacuate-vmis-via-live-migration-from-a-node","title":"Evacuate VMIs via Live Migration from a Node","text":"
If the LiveMigration feature gate is enabled, it is possible to specify an evictionStrategy on VMIs which will react with live-migrations on specific taints on nodes. The following snippet on a VMI or the VMI templates in a VM ensures that the VMI is migrated during node eviction:
Behind the scenes a PodDisruptionBudget is created for each VMI which has an evictionStrategy defined. This ensures that evictions are be blocked on these VMIs and that we can guarantee that a VMI will be migrated instead of shut off.
Note Prior to v0.34 the drain process with live migrations was detached from the kubectl drain itself and required in addition specifying a special taint on the nodes: kubectl taint nodes foo kubevirt.io/drain=draining:NoSchedule. This is no longer needed. The taint will still be respected if provided but is obsolete.
"},{"location":"cluster_admin/node_maintenance/#re-enabling-a-node-after-eviction","title":"Re-enabling a Node after Eviction","text":"
The kubectl drain will result in the target node being marked as unschedulable. This means the node will not be eligible for running new VirtualMachineInstances or Pods.
If it is decided that the target node should become schedulable again, the following command must be run.
kubectl uncordon <node name>
or in the case of OKD.
oc adm uncordon <node name>
"},{"location":"cluster_admin/node_maintenance/#shutting-down-a-node-after-eviction","title":"Shutting down a Node after Eviction","text":"
From KubeVirt's perspective, a node is safe to shutdown once all VirtualMachineInstances have been evicted from the node. In a multi-use cluster where VirtualMachineInstances are being scheduled alongside other containerized workloads, it is up to the cluster admin to ensure all other pods have been safely evicted before powering down the node.
The eviction of any VirtualMachineInstance that is owned by a VirtualMachine set to running=true will result in the VirtualMachineInstance being re-scheduled to another node.
The VirtualMachineInstance in this case will be forced to power down and restart on another node. In the future once KubeVirt introduces live migration support, the VM will be able to seamlessly migrate to another node during eviction.
The eviction of VirtualMachineInstances owned by a VirtualMachineInstanceReplicaSet will result in the VirtualMachineInstanceReplicaSet scheduling replacements for the evicted VirtualMachineInstances on other nodes in the cluster.
Hotplug Network Interfaces are not supported on Arm64, because the image ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick does not support for the Arm64 platform. For more information please refer to https://github.com/k8snetworkplumbingwg/multus-cni/pull/1027.
Hugepages feature is not supported on Arm64. The hugepage mechanism differs between X86_64 and Arm64. Now we only verify KubeVirt on 4k pagesize systems.
"},{"location":"cluster_admin/operations_on_Arm64/#containerized-data-importer","title":"Containerized Data Importer","text":"
For now, we have not supported this project on Arm64, but it is in our plan.
Export API is partially supported on the Arm64 platform. As CDI is not supported yet, the export of DataVolumes and MemoryDump are not supported on Arm64.
Scheduling is the process of matching Pods/VMs to Nodes. By default, the scheduler used is kube-scheduler. Further details can be found at Kubernetes Scheduler Documentation.
Custom schedulers can be used if the default scheduler does not satisfy your needs. For instance, you might want to schedule VMs using a load aware scheduler such as Trimaran Schedulers.
"},{"location":"cluster_admin/scheduler/#creating-a-custom-scheduler","title":"Creating a Custom Scheduler","text":"
KubeVirt is compatible with custom schedulers. The configuration steps are described in the Official Kubernetes Documentation. Please note, the Kubernetes version KubeVirt is running on and the Kubernetes version used to build the custom scheduler have to match. To get the Kubernetes version KubeVirt is running on, you can run the following command:
Pay attention to the Server line. In this case, the Kubernetes version is v1.22.13. You have to checkout the matching Kubernetes version and build the Kubernetes project:
$ cd kubernetes\n$ git checkout v1.22.13\n$ make\n
Then, you can follow the configuration steps described here. Additionally, the ClusterRole system:kube-scheduler needs permissions to use the verbs watch, list and get on StorageClasses.
"},{"location":"cluster_admin/scheduler/#scheduling-vms-with-the-custom-scheduler","title":"Scheduling VMs with the Custom Scheduler","text":"
The second scheduler should be up and running. You can check it with:
$ kubectl get all -n kube-system\n
The deployment my-scheduler should be up and running if everything is setup properly. In order to launch the VM using the custom scheduler, you need to set the SchedulerName in the VM's spec to my-scheduler. Here is an example VM definition:
In case the specified SchedulerName does not match any existing scheduler, the virt-launcher pod will stay in state Pending, until the specified scheduler can be found. You can check if the VM has been scheduled using the my-scheduler checking the virt-launcher pod events associated with the VM. The pod should have been scheduled with my-scheduler.
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vm-fedora-dpc87 2/2 Running 0 24m\n\n$ kubectl describe pod virt-launcher-vm-fedora-dpc87\n[...] \nEvents:\n Type Reason Age From Message\n ---- ------ ---- ---- -------\n Normal Scheduled 21m my-scheduler Successfully assigned default/virt-launcher-vm-fedora-dpc87 to node01\n[...]\n
"},{"location":"cluster_admin/tekton_tasks/#manipulate-pvcs-with-libguestfs-tools","title":"Manipulate PVCs with libguestfs tools","text":"
disk-virt-customize - execute virt-customize commands in PVCs.
disk-virt-sysprep- execute virt-sysprep commands in PVCs.
"},{"location":"cluster_admin/tekton_tasks/#wait-for-virtual-machine-instance-status","title":"Wait for Virtual Machine Instance Status","text":"
wait-for-vmi-status - Waits for a VMI to be running.
"},{"location":"cluster_admin/tekton_tasks/#modify-windows-iso","title":"Modify Windows iso","text":"
modify-windows-iso-file - modifies windows iso (replaces prompt bootloader with no-prompt bootloader) and replaces original iso in PVC with updated one. This helps with automated installation of Windows in EFI boot mode. By default Windows in EFI boot mode uses a prompt bootloader, which will not continue with the boot process until a key is pressed. By replacing it with the non-prompt bootloader no key press is required to boot into the Windows installer.
All these Tasks can be used for creating Pipelines. We prepared example Pipelines which show what can you do with the KubeVirt Tasks.
Windows efi installer - This Pipeline will prepare a Windows 10/11/2k22 datavolume with virtio drivers installed. User has to provide a working link to a Windows 10/11/2k22 iso file. The Pipeline is suitable for Windows versions, which requires EFI (e.g. Windows 10/11/2k22). More information about Pipeline can be found here
Windows customize - This Pipeline will install a SQL server or a VS Code in a Windows VM. More information about Pipeline can be found here
Note
If you define a different namespace for Pipelines and a different namespace for Tasks, you will have to create a cluster resolver object.
By default, example Pipelines create the resulting datavolume in the kubevirt-os-images namespace.
In case you would like to create resulting datavolume in different namespace (by specifying baseDvNamespace attribute in Pipeline), additional RBAC permissions will be required (list of all required RBAC permissions can be found here).
In case you would like to live migrate the VM while a given Pipeline is running, the following prerequisities must be met
KubeVirt has its own node daemon, called virt-handler. In addition to the usual k8s methods of detecting issues on nodes, the virt-handler daemon has its own heartbeat mechanism. This allows for fine-tuned error handling of VirtualMachineInstances.
If a VirtualMachineInstance gets scheduled, the scheduler is only considering nodes where kubevirt.io/schedulable is true. This can be seen when looking on the corresponding pod of a VirtualMachineInstance:
In case there is a communication issue or the host goes down, virt-handler can't update its labels and annotations any-more. Once the last kubevirt.io/heartbeat timestamp is older than five minutes, the KubeVirt node-controller kicks in and sets the kubevirt.io/schedulable label to false. As a consequence no more VMIs will be schedule to this node until virt-handler is connected again.
"},{"location":"cluster_admin/unresponsive_nodes/#deleting-stuck-vmis-when-virt-handler-is-unresponsive","title":"Deleting stuck VMIs when virt-handler is unresponsive","text":"
In cases where virt-handler has some issues but the node is in general fine, a VirtualMachineInstance can be deleted as usual via kubectl delete vmi <myvm>. Pods of a VirtualMachineInstance will be told by the cluster-controllers they should shut down. As soon as the Pod is gone, the VirtualMachineInstance will be moved to Failed state, if virt-handler did not manage to update it's heartbeat in the meantime. If virt-handler could recover in the meantime, virt-handler will move the VirtualMachineInstance to failed state instead of the cluster-controllers.
"},{"location":"cluster_admin/unresponsive_nodes/#deleting-stuck-vmis-when-the-whole-node-is-unresponsive","title":"Deleting stuck VMIs when the whole node is unresponsive","text":"
If the whole node is unresponsive, deleting a VirtualMachineInstance via kubectl delete vmi <myvmi> alone will never remove the VirtualMachineInstance. In this case all pods on the unresponsive node need to be force-deleted: First make sure that the node is really dead. Then delete all pods on the node via a force-delete: kubectl delete pod --force --grace-period=0 <mypod>.
As soon as the pod disappears and the heartbeat from virt-handler timed out, the VMIs will be moved to Failed state. If they were already marked for deletion they will simply disappear. If not, they can be deleted and will disappear almost immediately.
It takes up to five minutes until the KubeVirt cluster components can detect that virt-handler is unhealthy. During that time-frame it is possible that new VMIs are scheduled to the affected node. If virt-handler is not capable of connecting to these pods on the node, the pods will sooner or later go to failed state. As soon as the cluster finally detects the issue, the VMIs will be set to failed by the cluster.
"},{"location":"cluster_admin/updating_and_deletion/","title":"Updating and deletion","text":""},{"location":"cluster_admin/updating_and_deletion/#updating-kubevirt-control-plane","title":"Updating KubeVirt Control Plane","text":"
Zero downtime rolling updates are supported starting with release v0.17.0 onward. Updating from any release prior to the KubeVirt v0.17.0 release is not supported.
Note: Updating is only supported from N-1 to N release.
Updates are triggered one of two ways.
By changing the imageTag value in the KubeVirt CR's spec.
For example, updating from v0.17.0-alpha.1 to v0.17.0 is as simple as patching the KubeVirt CR with the imageTag: v0.17.0 value. From there the KubeVirt operator will begin the process of rolling out the new version of KubeVirt. Existing VM/VMIs will remain uninterrupted both during and after the update succeeds.
Or, by updating the kubevirt operator if no imageTag value is set.
When no imageTag value is set in the kubevirt CR, the system assumes that the version of KubeVirt is locked to the version of the operator. This means that updating the operator will result in the underlying KubeVirt installation being updated as well.
The first way provides a fine granular approach where you have full control over what version of KubeVirt is installed independently of what version of the KubeVirt operator you might be running. The second approach allows you to lock both the operator and operand to the same version.
Newer KubeVirt may require additional or extended RBAC rules. In this case, the #1 update method may fail, because the virt-operator present in the cluster doesn't have these RBAC rules itself. In this case, you need to update the virt-operator first, and then proceed to update kubevirt. See this issue for more details.
Workload updates are supported as an opt in feature starting with v0.39.0
By default, when KubeVirt is updated this only involves the control plane components. Any existing VirtualMachineInstance (VMI) workloads that are running before an update occurs remain 100% untouched. The workloads continue to run and are not interrupted as part of the default update process.
It's important to note that these VMI workloads do involve components such as libvirt, qemu, and virt-launcher, which can optionally be updated during the KubeVirt update process as well. However that requires opting in to having virt-operator perform automated actions on workloads.
Opting in to VMI updates involves configuring the workloadUpdateStrategy field on the KubeVirt CR. This field controls the methods virt-operator will use to when updating the VMI workload pods.
There are two methods supported.
LiveMigrate: Which results in VMIs being updated by live migrating the virtual machine guest into a new pod with all the updated components enabled.
Evict: Which results in the VMI's pod being shutdown. If the VMI is controlled by a higher level VirtualMachine object with runStrategy: always, then a new VMI will spin up in a new pod with updated components.
The least disruptive way to update VMI workloads is to use LiveMigrate. Any VMI workload that is not live migratable will be left untouched. If live migration is not enabled in the cluster, then the only option available for virt-operator managed VMI updates is the Evict method.
Example: Enabling VMI workload updates via LiveMigration
Example: Enabling VMI workload updates via Evict with batch tunings
The batch tunings allow configuring how quickly VMI's are evicted. In large clusters, it's desirable to ensure that VMI's are evicted in batches in order to distribute load.
Example: Enabling VMI workload updates with both LiveMigrate and Evict
When both LiveMigrate and Evict are specified, then any workloads which are live migratable will be guaranteed to be live migrated. Only workloads which are not live migratable will be evicted.
To delete the KubeVirt you should first to delete KubeVirt custom resource and then delete the KubeVirt operator.
$ export RELEASE=v0.17.0\n$ kubectl delete -n kubevirt kubevirt kubevirt --wait=true # --wait=true should anyway be default\n$ kubectl delete apiservices v1.subresources.kubevirt.io # this needs to be deleted to avoid stuck terminating namespaces\n$ kubectl delete mutatingwebhookconfigurations virt-api-mutator # not blocking but would be left over\n$ kubectl delete validatingwebhookconfigurations virt-operator-validator # not blocking but would be left over\n$ kubectl delete validatingwebhookconfigurations virt-api-validator # not blocking but would be left over\n$ kubectl delete -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml --wait=false\n
Note: If by mistake you deleted the operator first, the KV custom resource will get stuck in the Terminating state, to fix it, delete manually finalizer from the resource.
Note: The apiservice and the webhookconfigurations need to be deleted manually due to a bug.
Currently, Node-labeller is partially supported on Arm64 platform. It does not yet support parsing virsh_domcapabilities.xml and capabilities.xml, and extracting related information such as CPU features.
As Hugepages are a precondition of the NUMA feature, and Hugepages are not enabled on the Arm64 platform, the NUMA feature does not work on Arm64.
"},{"location":"cluster_admin/virtual_machines_on_Arm64/#disks-and-volumes","title":"Disks and Volumes","text":"
Arm64 only supports virtio and scsi disk bus types.
"},{"location":"cluster_admin/virtual_machines_on_Arm64/#interface-and-networks","title":"Interface and Networks","text":""},{"location":"cluster_admin/virtual_machines_on_Arm64/#macvlan","title":"macvlan","text":"
We do not support macvlan network because the project https://github.com/kubevirt/macvtap-cni does not support Arm64.
Support for redirection of client's USB device was introduced in release v0.44. This feature is not enabled by default. To enable it, add an empty clientPassthrough under devices, as such:
This configuration currently adds 4 USB slots to the VMI that can only be used with virtctl.
There are two ways of redirecting the same USB devices: Either using its device's vendor and product information or the actual bus and device address information. In Linux, you can gather this info with lsusb, a redacted example below:
"},{"location":"compute/client_passthrough/#using-vendor-and-product","title":"Using Vendor and Product","text":"
Redirecting the Kingston storage device.
virtctl usbredir 0951:1666 vmi-name\n
"},{"location":"compute/client_passthrough/#using-bus-and-device-address","title":"Using Bus and Device address","text":"
Redirecting the integrated camera
virtctl usbredir 01-03 vmi-name\n
"},{"location":"compute/client_passthrough/#requirements-for-virtctl-usbredir","title":"Requirements for virtctl usbredir","text":"
The virtctl command uses an application called usbredirect to handle client's USB device by unplugging the device from the Client OS and channeling the communication between the device and the VMI.
The usbredirect binary comes from the usbredir project and is supported by most Linux distros. You can either fetch the latest release or MSI installer for Windows support.
Managing USB devices requires privileged access in most Operation Systems. The user running virtctl usbredir would need to be privileged or run it in a privileged manner (e.g: with sudo)
The CPU hotplug feature was introduced in KubeVirt v1.0, making it possible to configure the VM workload to allow for adding or removing virtual CPUs while the VM is running.
A virtual CPU (vCPU) is the CPU that is seen to the Guest VM OS. A VM owner can manage the amount of vCPUs from the VM spec template using the CPU topology fields (spec.template.spec.domain.cpu). The cpu object has the integers cores,sockets,threads so that the virtual CPU is calculated by the following formula: cores * sockets * threads.
Before CPU hotplug was introduced, the VM owner could change these integers in the VM template while the VM is running, and they were staged until the next boot cycle. With CPU hotplug, it is possible to patch the sockets integer in the VM template and the change will take effect right away.
Per each new socket that is hot-plugged, the amount of new vCPUs that would be seen by the guest is cores * threads, since the overall calculation of vCPUs is cores * sockets * threads.
"},{"location":"compute/cpu_hotplug/#configure-the-workload-update-strategy","title":"Configure the workload update strategy","text":"
Current implementation of the hotplug process requires the VM to live-migrate. The migration will be triggered automatically by the workload updater. The workload update strategy in the KubeVirt CR must be configured with LiveMigrate, as follows:
"},{"location":"compute/cpu_hotplug/#configure-the-vm-rollout-strategy","title":"Configure the VM rollout strategy","text":"
Hotplug requires a VM rollout strategy of LiveUpdate, so that the changes made to the VM object propagate to the VMI without a restart. This is also done in the KubeVirt CR configuration:
Let's assume we have a running VM with the 4 vCPUs, which were configured with sockets:4 cores:1 threads:1 In the VMI status we can observe the current CPU topology the VM is running with:
Please note the condition HotVCPUChange that indicates the hotplug process is taking place. Also you can notice the VirtualMachineInstanceMigration object that was created for the VM in subject:
NAME PHASE VMI\nkubevirt-workload-update-kflnl Running vm-cirros\n
When the hotplug process has completed, the currentCPUTopology will be updated with the new number of sockets and the migration is marked as successful.
VPCU hotplug is currently not supported by ARM64 architecture.
Current hotplug implementation involves live-migration of the VM workload.
"},{"location":"compute/dedicated_cpu_resources/","title":"Dedicated CPU resources","text":"
Certain workloads, requiring a predictable latency and enhanced performance during its execution would benefit from obtaining dedicated CPU resources. KubeVirt, relying on the Kubernetes CPU manager, is able to pin guest's vCPUs to the host's pCPUs.
"},{"location":"compute/dedicated_cpu_resources/#kubernetes-cpu-manager","title":"Kubernetes CPU manager","text":"
Kubernetes CPU manager is a mechanism that affects the scheduling of workloads, placing it on a host which can allocate Guaranteed resources and pin certain Pod's containers to host pCPUs, if the following requirements are met:
Pod's QoS is Guaranteed
resources requests and limits are equal
all containers in the Pod express CPU and memory requirements
Requested number of CPUs is an Integer
Additional information:
Enabling the CPU manager on Kubernetes
Enabling the CPU manager on OKD
Kubernetes blog explaining the feature
"},{"location":"compute/dedicated_cpu_resources/#requesting-dedicated-cpu-resources","title":"Requesting dedicated CPU resources","text":"
Setting spec.domain.cpu.dedicatedCpuPlacement to true in a VMI spec will indicate the desire to allocate dedicated CPU resource to the VMI
Kubevirt will verify that all the necessary conditions are met, for the Kubernetes CPU manager to pin the virt-launcher container to dedicated host CPUs. Once, virt-launcher is running, the VMI's vCPUs will be pinned to the pCPUS that has been dedicated for the virt-launcher container.
Expressing the desired amount of VMI's vCPUs can be done by either setting the guest topology in spec.domain.cpu (sockets, cores, threads) or spec.domain.resources.[requests/limits].cpu to a whole number integer ([1-9]+) indicating the number of vCPUs requested for the VMI. Number of vCPUs is counted as sockets * cores * threads or if spec.domain.cpu is empty then it takes value from spec.domain.resources.requests.cpu or spec.domain.resources.limits.cpu.
Note: Users should not specify both spec.domain.cpu and spec.domain.resources.[requests/limits].cpu
Note: spec.domain.resources.requests.cpu must be equal to spec.domain.resources.limits.cpu
Note: Multiple cpu-bound microbenchmarks show a significant performance advantage when using spec.domain.cpu.sockets instead of spec.domain.cpu.cores.
"},{"location":"compute/dedicated_cpu_resources/#requesting-dedicated-cpu-for-qemu-emulator","title":"Requesting dedicated CPU for QEMU emulator","text":"
A number of QEMU threads, such as QEMU main event loop, async I/O operation completion, etc., also execute on the same physical CPUs as the VMI's vCPUs. This may affect the expected latency of a vCPU. In order to enhance the real-time support in KubeVirt and provide improved latency, KubeVirt will allocate an additional dedicated CPU, exclusively for the emulator thread, to which it will be pinned. This will effectively \"isolate\" the emulator thread from the vCPUs of the VMI. In case ioThreadsPolicy is set to auto IOThreads will also be \"isolated\" and placed on the same physical CPU as the QEMU emulator thread.
This functionality can be enabled by specifying isolateEmulatorThread: true inside VMI spec's Spec.Domain.CPU section. Naturally, this setting has to be specified in a combination with a dedicatedCpuPlacement: true.
KubeVirt will then add one or two dedicated CPUs for the emulator threads, in a way that completes the total CPU count to be even.
"},{"location":"compute/dedicated_cpu_resources/#identifying-nodes-with-a-running-cpu-manager","title":"Identifying nodes with a running CPU manager","text":"
At this time, Kubernetes doesn't label the nodes that has CPU manager running on it.
KubeVirt has a mechanism to identify which nodes has the CPU manager running and manually add a cpumanager=true label. This label will be removed when KubeVirt will identify that CPU manager is no longer running on the node. This automatic identification should be viewed as a temporary workaround until Kubernetes will provide the required functionality. Therefore, this feature should be manually enabled by activating the CPUManager feature gate to the KubeVirt CR.
When automatic identification is disabled, cluster administrator may manually add the above label to all the nodes when CPU Manager is running.
Nodes' labels are view-able: kubectl describe nodes
Administrators may manually label a missing node: kubectl label node [node_name] cpumanager=true
"},{"location":"compute/dedicated_cpu_resources/#sidecar-containers-and-cpu-allocation-overhead","title":"Sidecar containers and CPU allocation overhead","text":"
Note: In order to run sidecar containers, KubeVirt requires the Sidecar feature gate to be enabled in KubeVirt's CR.
According to the Kubernetes CPU manager model, in order the POD would reach the required QOS level Guaranteed, all containers in the POD must express CPU and memory requirements. At this time, Kubevirt often uses a sidecar container to mount VMI's registry disk. It also uses a sidecar container of it's hooking mechanism. These additional resources can be viewed as an overhead and should be taken into account when calculating a node capacity.
Note: The current defaults for sidecar's resources: CPU: 200mMemory: 64M As the CPU resource is not expressed as a whole number, CPU manager will not attempt to pin the sidecar container to a host CPU.
KubeVirt provides a mechanism for assigning host devices to a virtual machine. This mechanism is generic and allows various types of PCI devices, such as accelerators (including GPUs) or any other devices attached to a PCI bus, to be assigned. It also allows Linux Mediated devices, such as pre-configured virtual GPUs to be assigned using the same mechanism.
"},{"location":"compute/host-devices/#host-preparation-for-pci-passthrough","title":"Host preparation for PCI Passthrough","text":"
Host Devices passthrough requires the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) to be enabled in the BIOS.
To enable IOMMU, depending on the CPU type, a host should be booted with an additional kernel parameter, intel_iommu=on for Intel and amd_iommu=on for AMD.
Append these parameters to the end of the GRUB_CMDLINE_LINUX line in the grub configuration file.
The vfio-pci kernel module should be enabled on the host.
# modprobe vfio-pci\n
"},{"location":"compute/host-devices/#preparation-of-pci-devices-for-passthrough","title":"Preparation of PCI devices for passthrough","text":"
At this time, KubeVirt is only able to assign PCI devices that are using the vfio-pci driver. To prepare a specific device for device assignment, it should first be unbound from its original driver and bound to the vfio-pci driver.
"},{"location":"compute/host-devices/#preparation-of-mediated-devices-such-as-vgpu","title":"Preparation of mediated devices such as vGPU","text":"
In general, configuration of a Mediated devices (mdevs), such as vGPUs, should be done according to the vendor directions. KubeVirt can now facilitate the creation of the mediated devices / vGPUs on the cluster nodes. This assumes that the required vendor driver is already installed on the nodes. See the Mediated devices and virtual GPUs to learn more about this functionality.
Once the mdev is configured, KubeVirt will be able to discover and use it for device assignment.
Administrators can control which host devices are exposed and permitted to be used in the cluster. Permitted host devices in the cluster will need to be allowlisted in KubeVirt CR by its vendor:product selector for PCI devices or mediated device names.
pciVendorSelector is a PCI vendor ID and product ID tuple in the form vendor_id:product_id. This tuple can identify specific types of devices on a host. For example, the identifier 10de:1eb8, shown above, can be found using lspci.
mdevNameSelector is a name of a Mediated device type that can identify specific types of Mediated devices on a host.
You can see what mediated types a given PCI device supports by examining the contents of /sys/bus/pci/devices/SLOT:BUS:DOMAIN.FUNCTION/mdev_supported_types/TYPE/name. For example, if you have an NVIDIA T4 GPU on your system, and you substitute in the SLOT, BUS, DOMAIN, and FUNCTION values that are correct for your system into the above path name, you will see that a TYPE of nvidia-226 contains the selector string GRID T4-2A in its name file.
Taking GRID T4-2A and specifying it as the mdevNameSelector allows KubeVirt to find a corresponding mediated device by matching it against /sys/class/mdev_bus/SLOT:BUS:DOMAIN.FUNCTION/$mdevUUID/mdev_type/name for some values of SLOT:BUS:DOMAIN.FUNCTION and $mdevUUID.
External providers: externalResourceProvider field indicates that this resource is being provided by an external device plugin. In this case, KubeVirt will only permit the usage of this device in the cluster but will leave the allocation and monitoring to an external device plugin.
"},{"location":"compute/host-devices/#starting-a-virtual-machine","title":"Starting a Virtual Machine","text":"
Host devices can be assigned to virtual machines via the gpus and hostDevices fields. The deviceNames can reference both PCI and Mediated device resource names.
In order to passthrough an NVMe device the procedure is very similar to the gpu case. The device needs to be listed under the permittedHostDevice and under hostDevices in the VM declaration.
Currently, the KubeVirt device plugin doesn't allow the user to select a specific device by specifying the address. Therefore, if multiple NVMe devices with the same vendor and product id exist in the cluster, they could be randomly assigned to a VM. If the devices are not on the same node, then the nodeSelector mitigates the issue.
Cluster admin privilege to edit the KubeVirt CR in order to:
Enable the HostDevices feature gate
Edit the permittedHostDevices configuration to expose node USB devices to the cluster
"},{"location":"compute/host-devices/#exposing-usb-devices","title":"Exposing USB Devices","text":"
In order to assign USB devices to your VMI, you'll need to expose those devices to the cluster under a resource name. The device allowlist can be edited in KubeVirt CR under configuration.permittedHostDevices.usb.
For this example, we will use the kubevirt.io/storage resource name for the device with vendor: \"46f4\" and product: \"0001\"1.
After adding the usb configuration under permittedHostDevices to the KubeVirt CR, KubeVirt's device-plugin will expose this resource name and you can use it in your VMI.
"},{"location":"compute/host-devices/#adding-usb-to-your-vm","title":"Adding USB to your VM","text":"
Now, in the VMI configuration, you can add the devices.hostDevices.deviceName and reference the resource name provided in the previous step, and also give it a local name, for example:
You can find a working example, which uses QEMU's emulated USB storage, under examples/vmi-usb.yaml.
"},{"location":"compute/host-devices/#bundle-of-usb-devices","title":"Bundle of USB devices","text":"
You might be interested to redirect more than one USB device to a VMI, for example, a keyboard, a mouse and a smartcard device. The KubeVirt CR supports assigning multiple USB devices under the same resource name, so you could do:
To enable hugepages on Kubernetes, check the official documentation.
To enable hugepages on OKD, check the official documentation.
"},{"location":"compute/hugepages/#pre-allocate-hugepages-on-a-node","title":"Pre-allocate hugepages on a node","text":"
To pre-allocate hugepages on boot time, you will need to specify hugepages under kernel boot parameters hugepagesz=2M hugepages=64 and restart your machine.
You can find more about hugepages under official documentation.
Live migration is a process during which a running Virtual Machine Instance moves to another compute node while the guest workload continues to run and remain accessible.
"},{"location":"compute/live_migration/#enabling-the-live-migration-support","title":"Enabling the live-migration support","text":"
Live migration is enabled by default in recent versions of KubeVirt. Versions prior to v0.56, it must be enabled in the feature gates. The feature gates field in the KubeVirt CR must be expanded by adding the LiveMigration to it.
Virtual machines using a PersistentVolumeClaim (PVC) must have a shared ReadWriteMany (RWX) access mode to be live migrated.
Live migration is not allowed with a pod network binding of bridge interface type ()
Live migration requires ports 49152, 49153 to be available in the virt-launcher pod. If these ports are explicitly specified in masquarade interface, live migration will not function.
Live migration requires the virt-launcher pod's primary network interface to have the same name on both source and target pods.
"},{"location":"compute/live_migration/#initiate-live-migration","title":"Initiate live migration","text":"
Live migration is initiated by posting a VirtualMachineInstanceMigration (VMIM) object to the cluster. The example below starts a migration process for a virtual machine instance vmi-fedora
"},{"location":"compute/live_migration/#using-virtctl-to-initiate-live-migration","title":"Using virtctl to initiate live migration","text":"
Live migration can also be initiated using virtctl
virtctl migrate vmi-fedora\n
"},{"location":"compute/live_migration/#migration-status-reporting","title":"Migration Status Reporting","text":""},{"location":"compute/live_migration/#condition-and-migration-method","title":"Condition and migration method","text":"
When starting a virtual machine instance, it has also been calculated whether the machine is live migratable. The result is being stored in the VMI VMI.status.conditions. The calculation can be based on multiple parameters of the VMI, however, at the moment, the calculation is largely based on the Access Mode of the VMI volumes. Live migration is only permitted when the volume access mode is set to ReadWriteMany. Requests to migrate a non-LiveMigratable VMI will be rejected.
The reported Migration Method is also being calculated during VMI start. BlockMigration indicates that some of the VMI disks require copying from the source to the destination. LiveMigration means that only the instance memory will be copied.
The migration progress status is being reported in the VMI VMI.status. Most importantly, it indicates whether the migration has been Completed or if it Failed.
"},{"location":"compute/live_migration/#canceling-a-live-migration","title":"Canceling a live migration","text":"
Live migration can also be canceled by simply deleting the migration object. A successfully aborted migration will indicate that the abort has been requested Abort Requested, and that it succeeded: Abort Status: Succeeded. The migration in this case will be Completed and Failed.
KubeVirt puts some limits in place, so that migrations don't overwhelm the cluster. By default, it is configured to only run 5 migrations in parallel with an additional limit of a maximum of 2 outbound migrations per node. Finally, every migration is limited to a bandwidth of 64MiB/s.
Bear in mind that most of these configuration can be overridden and fine-tuned to a specified group of VMs. For more information, please see Migration Policies.
"},{"location":"compute/live_migration/#understanding-different-migration-strategies","title":"Understanding different migration strategies","text":"
Live migration is a complex process. During a migration, the source VM needs to transfer its whole state (mainly RAM) to the target VM. If there are enough resources available, such as network bandwidth and CPU power, migrations should converge nicely. If this is not the scenario, however, the migration might get stuck without an ability to progress.
The main factor that affects migrations from the guest perspective is its dirty rate, which is the rate by which the VM dirties memory. Guests with high dirty rate lead to a race during migration. On the one hand, memory would be transferred continuously to the target, and on the other, the same memory would get dirty by the guest. On such scenarios, one could consider to use more advanced migration strategies.
Let's explain the 3 supported migration strategies as of today.
Pre-copy is the default strategy. It should be used for most cases.
The way it works is as following:
The target VM is created, but the guest keeps running on the source VM.
The source starts sending chunks of VM state (mostly memory) to the target. This continues until all of the state has been transferred to the target.
The guest starts executing on the target VM.
The source VM is being removed.
Pre-copy is the safest and fastest strategy for most cases. Furthermore, it can be easily cancelled, can utilize multithreading, and more. If there is no real reason to use another strategy, this is definitely the strategy to go with.
However, on some cases migrations might not converge easily, that is, by the time the chunk of source VM state would be received by the target VM, it would already be mutated by the source VM (which is the VM the guest executes on). There are many reasons for migrations to fail converging, such as a high dirty-rate or low resources like network bandwidth and CPU. On such scenarios, see the following alternative strategies below.
The way post-copy migrations work is as following:
The target VM is created.
The guest is being run on the target VM.
The source starts sending chunks of VM state (mostly memory) to the target.
When the guest, running on the target VM, would access memory:
If the memory exists on the target VM, the guest can access it.
Otherwise, the target VM asks for a chunk of memory from the source VM.
Once all of the memory state is updated at the target VM, the source VM is being removed.
The main idea here is that the guest starts to run immediately on the target VM. This approach has advantages and disadvantages:
advantages:
The same memory chunk is never being transferred twice. This is possible due to the fact that with post-copy it doesn't matter that a page had been dirtied since the guest is already running on the target VM.
This means that a high dirty-rate has much less effect.
Consumes less network bandwidth.
disadvantages:
When using post-copy, the VM state has no one source of truth. When the guest (running on the target VM) writes to memory, this memory is one part of the guest's state, but some other parts of it may still be updated only at the source VM. This situation is generally dangerous, since, for example, if either the target or guest VMs crash the state cannot be recovered.
Slow warmup: when the guest starts executing, no memory is present at the target VM. Therefore, the guest would have to wait for a lot of memory in a short period of time.
Auto-converge is a technique to help pre-copy migrations converge faster without changing the core algorithm of how the migration works.
Since a high dirty-rate is usually the most significant factor for migrations to not converge, auto-converge simply throttles the guest's CPU. If the migration would converge fast enough, the guest's CPU would not be throttled or throttled negligibly. But, if the migration would not converge fast enough, the CPU would be throttled more and more as time goes.
This technique dramatically increases the probability of the migration converging eventually.
"},{"location":"compute/live_migration/#using-a-different-network-for-migrations","title":"Using a different network for migrations","text":"
Live migrations can be configured to happen on a different network than the one Kubernetes is configured to use. That potentially allows for more determinism, control and/or bandwidth, depending on use-cases.
"},{"location":"compute/live_migration/#creating-a-migration-network-on-a-cluster","title":"Creating a migration network on a cluster","text":"
A separate physical network is required, meaning that every node on the cluster has to have at least 2 NICs, and the NICs that will be used for migrations need to be interconnected, i.e. all plugged to the same switch. The examples below assume that eth1 will be used for migrations.
It is also required for the Kubernetes cluster to have multus installed.
If the desired network doesn't include a DHCP server, then whereabouts will be needed as well.
Finally, a NetworkAttachmentDefinition needs to be created in the namespace where KubeVirt is installed. Here is an example:
"},{"location":"compute/live_migration/#configuring-kubevirt-to-migrate-vmis-over-that-network","title":"Configuring KubeVirt to migrate VMIs over that network","text":"
This is just a matter of adding the name of the NetworkAttachmentDefinition to the KubeVirt CR, like so:
That change will trigger a restart of the virt-handler pods, as they get connected to that new network.
From now on, migrations will happen over that network.
"},{"location":"compute/live_migration/#configuring-kubevirtci-for-testing-migration-networks","title":"Configuring KubeVirtCI for testing migration networks","text":"
Developers and people wanting to test the feature before deploying it on a real cluster might want to configure a dedicated migration network in KubeVirtCI.
KubeVirtCI can simply be configured to include a virtual secondary network, as well as automatically install multus and whereabouts. The following environment variables just have to be declared before running make cluster-up:
Depending on the type, the live migration process will copy virtual machine memory pages and disk blocks to the destination. During this process non-locked pages and blocks are being copied and become free for the instance to use again. To achieve a successful migration, it is assumed that the instance will write to the free pages and blocks (pollute the pages) at a lower rate than these are being copied.
In some cases the virtual machine can have a high dirty-rate, which means it will write to different memory pages / disk blocks at a higher rate than these can be copied over. This situation will prevent the migration process from completing in a reasonable amount of time.
In this case, a timeout can be defined so that live migration will either be aborted or switched to post-copy mode (if it's enabled) if it is running for a long period of time.
The timeout is calculated based on the size of the VMI, its memory and the ephemeral disks that are needed to be copied. The configurable parameter completionTimeoutPerGiB, which defaults to 800s, is the maximum amount of time per GiB of data allowed before the migration gets aborted / switched to post-copy mode. For example, with the default value, a VMI with 8GiB of memory will time-out after 6400 seconds.
Live migration will also be aborted when it will be noticed that copying memory doesn't make any progress. The time to wait for live migration to make progress in transferring data is configurable by progressTimeout parameter, which defaults to 150s
Note: While this increases performance it may allow MITM attacks. Be careful.
"},{"location":"compute/mediated_devices_configuration/","title":"Mediated devices and virtual GPUs","text":""},{"location":"compute/mediated_devices_configuration/#configuring-mediated-devices-and-virtual-gpus","title":"Configuring mediated devices and virtual GPUs","text":"
KubeVirt aims to facilitate the configuration of mediated devices on large clusters. Administrators can use the mediatedDevicesConfiguration API in the KubeVirt CR to create or remove mediated devices in a declarative way, by providing a list of the desired mediated device types that they expect to be configured in the cluster.
You can also include the nodeMediatedDeviceTypes option to provide a more specific configuration that targets a specific node or a group of nodes directly with a node selector. The nodeMediatedDeviceTypes option must be used in combination with mediatedDevicesTypes in order to override the global configuration set in the mediatedDevicesTypes section.
KubeVirt will use the provided configuration to automatically create the relevant mdev/vGPU devices on nodes that can support it.
Currently, a single mdev type per card will be configured. The maximum amount of instances of the selected mdev type will be configured per card.
Note: Some vendors, such as NVIDIA, require a driver to be installed on the nodes to provide mediated devices, including vGPUs.
Example snippet of a KubeVirt CR configuration that includes both nodeMediatedDeviceTypes and mediatedDevicesTypes:
"},{"location":"compute/mediated_devices_configuration/#configuration-scenarios","title":"Configuration scenarios","text":""},{"location":"compute/mediated_devices_configuration/#example-large-cluster-with-multiple-cards-on-each-node","title":"Example: Large cluster with multiple cards on each node","text":"
On nodes with multiple cards that can support similar vGPU types, the relevant desired types will be created in a round-robin manner.
For example, considering the following KubeVirt CR configuration:
This cluster has nodes with two different PCIe cards:
Nodes with 3 Tesla T4 cards, where each card can support multiple devices types:
nvidia-222
nvidia-223
nvidia-228
...
Nodes with 2 Tesla V100 cards, where each card can support multiple device types:
nvidia-105
nvidia-108
nvidia-217
nvidia-299
...
KubeVirt will then create the following devices:
Nodes with 3 Tesla T4 cards will be configured with:
16 vGPUs of type nvidia-222 on card 1
2 vGPUs of type nvidia-228 on card 2
16 vGPUs of type nvidia-222 on card 3
Nodes with 2 Tesla V100 cards will be configured with:
16 vGPUs of type nvidia-105 on card 1
2 vGPUs of type nvidia-108 on card 2
"},{"location":"compute/mediated_devices_configuration/#example-single-card-on-a-node-multiple-desired-vgpu-types-are-supported","title":"Example: Single card on a node, multiple desired vGPU types are supported","text":"
When nodes only have a single card, the first supported type from the list will be configured.
For example, consider the following list of desired types, where nvidia-223 and nvidia-224 are supported:
In this case, nvidia-223 will be configured on the node because it is the first supported type in the list."},{"location":"compute/mediated_devices_configuration/#overriding-configuration-on-a-specifc-node","title":"Overriding configuration on a specifc node","text":"
To override the global configuration set by mediatedDevicesTypes, include the nodeMediatedDeviceTypes option, specifying the node selector and the mediatedDevicesTypes that you want to override for that node.
"},{"location":"compute/mediated_devices_configuration/#example-overriding-the-configuration-for-a-specific-node-in-a-large-cluster-with-multiple-cards-on-each-node","title":"Example: Overriding the configuration for a specific node in a large cluster with multiple cards on each node","text":"
In this example, the KubeVirt CR includes the nodeMediatedDeviceTypes option to override the global configuration specifically for node 2, which will only use the nvidia-234 type.
The cluster has two nodes that both have 3 Tesla T4 cards.
Each card can support a long list of types, including:
nvidia-222
nvidia-223
nvidia-224
nvidia-230
...
KubeVirt will then create the following devices:
Node 1
type nvidia-230 on card 1
type nvidia-223 on card 2
Node 2
type nvidia-234 on card 1 and card 2
Node 1 has been configured in a round-robin manner based on the global configuration but node 2 only uses the nvidia-234 that was specified for it.
"},{"location":"compute/mediated_devices_configuration/#updating-and-removing-vgpu-types","title":"Updating and Removing vGPU types","text":"
Changes made to the mediatedDevicesTypes section of the KubeVirt CR will trigger a re-evaluation of the configured mdevs/vGPU types on the cluster nodes.
Any change to the node labels that match the nodeMediatedDeviceTypes nodeSelector in the KubeVirt CR will trigger a similar re-evaluation.
Consequently, mediated devices will be reconfigured or entirely removed based on the updated configuration.
"},{"location":"compute/mediated_devices_configuration/#assigning-vgpumdev-to-a-virtual-machine","title":"Assigning vGPU/MDEV to a Virtual Machine","text":"
See the Host Devices Assignment to learn how to consume the newly created mediated devices/vGPUs.
Kubevirt now supports getting a VM memory dump for analysis purposes. The Memory dump can be used to diagnose, identify and resolve issues in the VM. Typically providing information about the last state of the programs, applications and system before they were terminated or crashed.
Note This memory dump is not used for saving VM state and resuming it later.
The memory dump process mounts a PVC to the virt-launcher in order to get the output in that PVC, hence the hot plug volumes feature gate must be enabled. The feature gates field in the KubeVirt CR must be expanded by adding the HotplugVolumes to it.
The size of the PVC must be big enough to hold the memory dump. The calculation is (VMMemorySize + 100Mi) * FileSystemOverhead, Where VMMemorySize is the memory size, 100Mi is reserved space for the memory dump overhead and FileSystemOverhead is the value used to adjust requested PVC size with the filesystem overhead. also the PVC must have a FileSystem volume mode.
By adding the --output flag, the memory will be dumped to the PVC and then downloaded to the given output path.
$ virtctl memory-dump get myvm --claim-name=memoryvolume --create-claim --output=memoryDump.dump.gz\n
For downloading the last memory dump from the PVC associated with the VM, without triggering another memory dump, use the memory dump download command.
During the process the volumeStatus on the VMI will be updated with the process information such as the attachment pod information and messages, if all goes well once the process is completed, the PVC is unmounted from the virt-launcher pod and the volumeStatus is deleted. A memory dump annotation will be added to the PVC with the memory dump file name.
"},{"location":"compute/memory_dump/#retriggering-the-memory-dump","title":"Retriggering the memory dump","text":"
Getting a new memory dump to the same PVC is possible without the need to use any flag:
$ virtctl memory-dump get my-vm\n
Note Each memory-dump command will delete the previous dump in that PVC.
In order to get a memory dump to a different PVC you need to 'remove' the current memory-dump PVC and then do a new get with the new PVC name.
As mentioned in order to remove the associated memory dump PVC you need to run a 'memory-dump remove' command. This will allow you to replace the current PVC and get the memory dump to a new one.
$ virtctl memory-dump remove my-vm\n
"},{"location":"compute/memory_dump/#handle-the-memory-dump","title":"Handle the memory dump","text":"
Once the memory dump process is completed the PVC will hold the output. You can manage the dump in one of the following ways: - Download the memory dump - Create a pod with troubleshooting tools that will mount the PVC and inspect it within the pod. - Include the memory dump in the VM Snapshot (will include both the memory dump and the disks) to save a snapshot of the VM in that point of time and inspect it when needed. (The VM Snapshot can be exported and downloaded).
The output of the memory dump can be inspected with memory analysis tools for example Volatility3
"},{"location":"compute/memory_hotplug/#configure-the-workload-update-strategy","title":"Configure the Workload Update Strategy","text":"
Configure LiveMigrate as workloadUpdateStrategy in the KubeVirt CR, since the current implementation of the hotplug process requires the VM to live-migrate.
"},{"location":"compute/memory_hotplug/#configure-the-vm-rollout-strategy","title":"Configure the VM rollout strategy","text":"
Finally, set the VM rollout strategy to LiveUpdate, so that the changes made to the VM object propagate to the VMI without a restart. This is also done in the KubeVirt CR configuration:
NOTE: If memory hotplug is enabled/disabled on an already running VM, a reboot is necessary for the changes to take effect.
More information can be found on the VM Rollout Strategies page.
"},{"location":"compute/memory_hotplug/#optional-set-a-cluster-wide-maximum-amount-of-memory","title":"[OPTIONAL] Set a cluster-wide maximum amount of memory","text":"
You can set the maximum amount of memory for the guest using a cluster level setting in the KubeVirt CR.
The VM-level configuration will take precedence over the cluster-wide one.
"},{"location":"compute/memory_hotplug/#memory-hotplug-in-action","title":"Memory Hotplug in Action","text":"
First we enable the VMLiveUpdateFeatures feature gate, set the rollout strategy to LiveUpdate and set LiveMigrate as workloadUpdateStrategy in the KubeVirt CR.
The Virtual Machine will automatically start and once booted it will report the currently available memory to the guest in the status.memory field inside the VMI.
$ kubectl get vmi vm-cirros -o json | jq .status.memory\n
After the hotplug request is processed and the Virtual Machine is live migrated, the new amount of memory should be available to the guest and visible in the VMI object.
$ kubectl get vmi vm-cirros -o json | jq .status.memory\n
Setting spec.nodeSelector requirements, constrains the scheduler to only schedule VMs on nodes, which contain the specified labels. In the following example the vmi contains the labels cpu: slow and storage: fast:
Thus the scheduler will only schedule the vmi to nodes which contain these labels in their metadata. It works exactly like the Pods nodeSelector. See the Pod nodeSelector Documentation for more examples.
"},{"location":"compute/node_assignment/#affinity-and-anti-affinity","title":"Affinity and anti-affinity","text":"
The spec.affinity field allows specifying hard- and soft-affinity for VMs. It is possible to write matching rules against workloads (VMs and Pods) and Nodes. Since VMs are a workload type based on Pods, Pod-affinity affects VMs as well.
An example for podAffinity and podAntiAffinity may look like this:
Affinity and anti-affinity works exactly like the Pods affinity. This includes podAffinity, podAntiAffinity, nodeAffinity and nodeAntiAffinity. See the Pod affinity and anti-affinity Documentation for more examples and details.
"},{"location":"compute/node_assignment/#taints-and-tolerations","title":"Taints and Tolerations","text":"
Affinity as described above, is a property of VMs that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite - they allow a node to repel a set of VMs.
Taints and tolerations work together to ensure that VMs are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any VMs that do not tolerate the taints. Tolerations are applied to VMs, and allow (but do not require) the VMs to schedule onto nodes with matching taints.
You add a taint to a node using kubectl taint. For example,
"},{"location":"compute/node_assignment/#node-balancing-with-descheduler","title":"Node balancing with Descheduler","text":"
In some cases we might need to rebalance the cluster on current scheduling policy and load conditions. Descheduler can find pods, which violates e.g. scheduling decisions and evict them based on descheduler policies. Kubevirt VMs are handled as pods with local storage, so by default, descheduler will not evict them. But it can be easily overridden by adding special annotation to the VMI template in the VM:
This annotation will cause, that the descheduler will be able to evict the VM's pod which can then be scheduled by scheduler on different nodes. A VirtualMachine will never restart or re-create a VirtualMachineInstance until the current instance of the VirtualMachineInstance is deleted from the cluster.
When the VM rollout strategy is set to LiveUpdate, changes to a VM's node selector or affinities will dynamically propagate to the VMI (unless the RestartRequired condition is set). Changes to tolerations will not dynamically propagate, and will trigger a RestartRequired condition if changed on a running VM.
Modifications of the node selector / affinities will only take effect on next migration, the change alone will not trigger one.
KubeVirt does not yet support classical Memory Overcommit Management or Memory Ballooning. In other words VirtualMachineInstances can't give back memory they have allocated. However, a few other things can be tweaked to reduce the memory footprint and overcommit the per-VMI memory overhead.
"},{"location":"compute/node_overcommit/#remove-the-graphical-devices","title":"Remove the Graphical Devices","text":"
First the safest option to reduce the memory footprint, is removing the graphical device from the VMI by setting spec.domain.devices.autottachGraphicsDevice to false. See the video and graphics device documentation for further details and examples.
This will save a constant amount of 16MB per VirtualMachineInstance but also disable VNC access.
"},{"location":"compute/node_overcommit/#overcommit-the-guest-overhead","title":"Overcommit the Guest Overhead","text":"
Before you continue, make sure you make yourself comfortable with the Out of Resource Management of Kubernetes.
Every VirtualMachineInstance requests slightly more memory from Kubernetes than what was requested by the user for the Operating System. The additional memory is used for the per-VMI overhead consisting of our infrastructure which is wrapping the actual VirtualMachineInstance process.
In order to increase the VMI density on the node, it is possible to not request the additional overhead by setting spec.domain.resources.overcommitGuestOverhead to true:
This will work fine for as long as most of the VirtualMachineInstances will not request the whole memory. That is especially the case if you have short-lived VMIs. But if you have long-lived VirtualMachineInstances or do extremely memory intensive tasks inside the VirtualMachineInstance, your VMIs will use all memory they are granted sooner or later.
The third option is real memory overcommit on the VMI. In this scenario the VMI is explicitly told that it has more memory available than what is requested from the cluster by setting spec.domain.memory.guest to a value higher than spec.domain.resources.requests.memory.
The following definition requests 1024MB from the cluster but tells the VMI that it has 2048MB of memory available:
For as long as there is enough free memory available on the node, the VMI can happily consume up to 2048MB. This VMI will get the Burstable resource class assigned by Kubernetes (See QoS classes in Kubernetes for more details). The same eviction rules like for Pods apply to the VMI in case the node gets under memory pressure.
Implicit memory overcommit is disabled by default. This means that when memory request is not specified, it is set to match spec.domain.memory.guest. However, it can be enabled using spec.configuration.developerConfiguration.memoryOvercommit in the kubevirt CR. For example, by setting memoryOvercommit: \"150\" we define that when memory request is not explicitly set, it will be implicitly set to achieve memory overcommit of 150%. For instance, when spec.domain.memory.guest: 3072M, memory request is set to 2048M, if omitted. Note that the actual memory request depends on additional configuration options like OvercommitGuestOverhead.
"},{"location":"compute/node_overcommit/#configuring-the-memory-pressure-behavior-of-nodes","title":"Configuring the memory pressure behavior of nodes","text":"
If the node gets under memory pressure, depending on the kubelet configuration the virtual machines may get killed by the OOM handler or by the kubelet itself. It is possible to tweak that behaviour based on the requirements of your VirtualMachineInstances by:
Configuring Soft Eviction Thresholds
Configuring Hard Eviction Thresholds
Requesting the right QoS class for VirtualMachineInstances
Note: Soft Eviction will effectively shutdown VirtualMachineInstances. They are not paused, hibernated or migrated. Further, Soft Eviction is disabled by default.
If configured, VirtualMachineInstances get evicted once the available memory falls below the threshold specified via --eviction-soft and the VirtualmachineInstance is given the chance to perform a shutdown of the VMI within a timespan specified via --eviction-max-pod-grace-period. The flag --eviction-soft-grace-period specifies for how long a soft eviction condition must be held before soft evictions are triggered.
If set properly according to the demands of the VMIs, overcommitting should only lead to soft evictions in rare cases for some VMIs. They may even get re-scheduled to the same node with less initial memory demand. For some workload types, this can be perfectly fine and lead to better overall memory-utilization.
"},{"location":"compute/node_overcommit/#configuring-hard-eviction-thresholds","title":"Configuring Hard Eviction Thresholds","text":"
Note: If unspecified, the kubelet will do hard evictions for Pods once memory.available falls below 100Mi.
Limits set via --eviction-hard will lead to immediate eviction of VirtualMachineInstances or Pods. This stops VMIs without a grace period and is comparable with power-loss on a real computer.
If the hard limit is hit, VMIs may from time to time simply be killed. They may be re-scheduled to the same node immediately again, since they start with less memory consumption again. This can be a simple option, if the memory threshold is only very seldom hit and the work performed by the VMIs is reproducible or it can be resumed from some checkpoints.
"},{"location":"compute/node_overcommit/#requesting-the-right-qos-class-for-virtualmachineinstances","title":"Requesting the right QoS Class for VirtualMachineInstances","text":"
Different QoS classes get assigned to Pods and VirtualMachineInstances based on the requests.memory and limits.memory. KubeVirt right now supports the QoS classes Burstable and Guaranteed. Burstable VMIs are evicted before Guaranteed VMIs.
This allows creating two classes of VMIs:
One type can have equal requests.memory and limits.memory set and therefore gets the Guaranteed class assigned. This one will not get evicted and should never run into memory issues, but is more demanding.
One type can have no limits.memory or a limits.memory which is greater than requests.memory and therefore gets the Burstable class assigned. These VMIs will be evicted first.
"},{"location":"compute/node_overcommit/#setting-system-reserved-and-kubelet-reserved","title":"Setting --system-reserved and --kubelet-reserved","text":"
It may be important to reserve some memory for other daemons (not DaemonSets) which are running on the same node (ssh, dhcp servers, etc). The reservation can be done with the --system reserved switch. Further for the Kubelet and Docker a special flag called --kubelet-reserved exists.
The KSM (Kernel same-page merging) daemon can be started on the node. Depending on its tuning parameters it can more or less aggressively try to merge identical pages between applications and VirtualMachineInstances. The more aggressive it is configured the more CPU it will use itself, so the memory overcommit advantages comes with a slight CPU performance hit.
Config file tuning allows changes to scanning frequency (how often will KSM activate) and aggressiveness (how many pages per second will it scan).
Note: This will definitely make sure that your VirtualMachines can't crash or get evicted from the node but it comes with the cost of pretty unpredictable performance once the node runs out of memory and the kubelet may not detect that it should evict Pods to increase the performance again.
Enabling swap is in general not recommended on Kubernetes right now. However, it can be useful in combination with KSM, since KSM merges identical pages over time. Swap allows the VMIs to successfully allocate memory which will then effectively never be used because of the later de-duplication done by KSM.
"},{"location":"compute/node_overcommit/#node-cpu-allocation-ratio","title":"Node CPU allocation ratio","text":"
KubeVirt runs Virtual Machines in a Kubernetes Pod. This pod requests a certain amount of CPU time from the host. On the other hand, the Virtual Machine is being created with a certain amount of vCPUs. The number of vCPUs may not necessarily correlate to the number of requested CPUs by the POD. Depending on the QOS of the POD, vCPUs can be scheduled on a variable amount of physical CPUs; this depends on the available CPU resources on a node. When there are fewer available CPUs on the node as the requested vCPU, vCPU will be over committed.
By default, each pod requests 100mil of CPU time. The CPU requested on the pod sets the cgroups cpu.shares which serves as a priority for the scheduler to provide CPU time for vCPUs in this POD. As the number of vCPUs increases, this will reduce the amount of CPU time each vCPU may get when competing with other processes on the node or other Virtual Machine Instances with a lower amount of vCPUs.
The cpuAllocationRatio comes to normalize the amount of CPU time the POD will request based on the number of vCPUs. For example, POD CPU request = number of vCPUs * 1/cpuAllocationRatio When cpuAllocationRatio is set to 1, a full amount of vCPUs will be requested for the POD.
Note: In Kubernetes, one full core is 1000 of CPU time More Information
Administrators can change this ratio by updating the KubeVirt CR
NUMA support in KubeVirt is at this stage limited to a small set of special use-cases and will improve over time together with improvements made to Kubernetes.
In general, the goal is to map the host NUMA topology as efficiently as possible to the Virtual Machine topology to improve the performance.
The following NUMA mapping strategies can be used:
GuestMappingPassthrough will pass through the node numa topology to the guest. The topology is based on the dedicated CPUs which the VMI got assigned from the kubelet via the CPU Manager. It can be requested by setting spec.domain.cpu.numa.guestMappingPassthrough on the VMI.
Since KubeVirt does not know upfront which exclusive CPUs the VMI will get from the kubelet, there are some limitations:
Guests may see different NUMA topologies when being rescheduled.
The resulting NUMA topology may be asymmetrical.
The VMI may fail to start on the node if not enough hugepages are available on the assigned NUMA nodes.
While this NUMA modelling strategy has its limitations, aligning the guest's NUMA architecture with the node's can be critical for high-performance applications.
It is possible to deploy Virtual Machines that run a real-time kernel and make use of libvirtd's guest cpu and memory optimizations that improve the overall latency. These changes leverage mostly on already available settings in KubeVirt, as we will see shortly, but the VMI manifest now exposes two new settings that instruct KubeVirt to configure the generated libvirt XML with the recommended tuning settings for running real-time workloads.
To make use of the optimized settings, two new settings have been added to the VMI schema:
spec.domain.cpu.realtime: When defined, it instructs KubeVirt to configure the linux scheduler for the VCPUS to run processes in FIFO scheduling policy (SCHED_FIFO) with priority 1. This setting guarantees that all processes running in the host will be executed with real-time priority.
spec.domain.cpu.realtime.mask: It defines which VCPUs assigned to the VM are used for real-time. If not defined, libvirt will define all VCPUS assigned to run processes in FIFO scheduling and in the highest priority (1).
A prerequisite to running real-time workloads include locking resources in the cluster to allow the real-time VM exclusive usage. This translates into nodes, or node, that have been configured with a dedicated set of CPUs and also provides support for NUMA with a free number of hugepages of 2Mi or 1Gi size (depending on the configuration in the VMI). Additionally, the node must be configured to allow the scheduler to run processes with real-time policy.
"},{"location":"compute/numa/#nodes-capable-of-running-real-time-workloads","title":"Nodes capable of running real-time workloads","text":"
When the KubeVirt pods are deployed in a node, it will check if it is capable of running processes in real-time scheduling policy and label the node as real-time capable (kubevirt.io/realtime). If, on the other hand, the node is not able to deliver such capability, the label is not applied. To check which nodes are able to host real-time VM workloads run this command:
$>kubectl get nodes -l kubevirt.io/realtime\nNAME STATUS ROLES AGE VERSION\nworker-0-0 Ready worker 12d v1.20.0+df9c838\n
Internally, the KubeVirt pod running in each node checks if the kernel setting kernel.sched_rt_runtime_us equals to -1, which grants processes to run in real-time scheduling policy for an unlimited amount of time.
"},{"location":"compute/numa/#configuring-a-vm-manifest","title":"Configuring a VM Manifest","text":"
Here is an example of a VM manifest that runs a custom fedora container disk configured to run with a real-time kernel. The settings have been configured for optimal efficiency.
CPU: - model: host-passthrough to allow the guest to see host CPU without masking any capability. - dedicated CPU Placement: The VM needs to have dedicated CPUs assigned to it. The Kubernetes CPU Manager takes care of this aspect. - isolatedEmulatorThread: to request an additional CPU to run the emulator on it, thus avoid using CPU cycles from the workload CPUs. - ioThreadsPolicy: Set to auto to let the dedicated IO thread to run in the same CPU as the emulator thread. - NUMA: defining guestMappingPassthrough enables NUMA support for this VM. - realtime: instructs the virt-handler to configure this VM for real-time workloads, such as configuring the VCPUS to use FIFO scheduler policy and set priority to 1. cpu:
When applied this configuration, KubeVirt will only set the first VCPU for real-time scheduler policy, leaving the remaining VCPUS to use the default scheduler policy. Other examples of valid masks are: - 0-3: Use cores 0 to 3 for real-time scheduling, assuming that the VM has requested at least 3 cores. - 0-3,^1: Use cores 0, 2 and 3 for real-time scheduling only, assuming that the VM has requested at least 3 cores.
Kubernetes provides additional NUMA components that may be relevant to your use-case but typically are not enabled by default. Please consult the Kubernetes documentation for details on configuration of these components.
Topology Manager provides optimizations related to CPU isolation, memory and device locality. It is useful, for example, where an SR-IOV network adaptor VF allocation needs to be aligned with a NUMA node.
Memory Manager is analogous to CPU Manager. It is useful, for example, where you want to align hugepage allocations with a NUMA node. It works in conjunction with Topology Manager.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
"},{"location":"compute/persistent_tpm_and_uefi_state/","title":"Persistent TPM and UEFI state","text":"
FEATURE STATE: KubeVirt v1.0.0
For both TPM and UEFI, libvirt supports persisting data created by a virtual machine as files on the virtualization host. In KubeVirt, the virtualization host is the virt-launcher pod, which is ephemeral (created on VM start and destroyed on VM stop). As of v1.0.0, KubeVirt supports using a PVC to persist those files. KubeVirt usually refers to that storage area as \"backend storage\".
KubeVirt automatically creates backend storage PVCs for VMs that need it. However, the admin must first enable the VMPersistentState feature gate, and tell KubeVirt which storage class to use by setting the vmStateStorageClass configuration parameter in the KubeVirt Custom Resource (CR). The storage class must support read-write-many (RWX) in filesystem mode (FS). Here's an example of KubeVirt CR that sets both:
As mentioned above, the backend storage PVC can only be created using a storage class that supports RWX FS. There is ongoing work to support block storage in future versions of KubeVirt.
Backend storage is currently incompatible with VM snapshot. It is planned to add snapshot support in the future.
"},{"location":"compute/persistent_tpm_and_uefi_state/#tpm-with-persistent-state","title":"TPM with persistent state","text":"
Since KubeVirt v0.53.0, a TPM device can be added to a VM (with just tpm: {}). However, the data stored in it does not persist across reboots. Support for persistence was added in v1.0.0 using a simple persistent boolean parameter that default to false, to preserve previous behavior. Of course, backend storage must first be configured before adding a persistent TPM to a VM. See above. Here's a portion of a VM definition that includes a persistent TPM:
The Microsoft Windows 11 installer requires the presence of a TPM device, even though it doesn't use this. Persistence is not required in this case however.
Some disk encryption software have optional/mandatory TPM support. For example, Bitlocker requires a persistent TPM device.
The TPM device exposed to the virtual machine is fully emulated (vTPM). The worker nodes do not need to have a TPM device.
When TPM persistence is enabled, the tpm-crb model is used (instead of tpm-tis for non-persistent vTPMs)
A virtual TPM does not provide the same security guarantees as a physical one.
"},{"location":"compute/persistent_tpm_and_uefi_state/#efi-with-persistent-vars","title":"EFI with persistent VARS","text":"
EFI support is handled by libvirt using OVMF. OVMF data usually consists of 2 files, CODE and VARS. VARS is where persistent data from the guest can be stored. When EFI persistence is enabled on a VM, the VARS file will be persisted inside the backend storage. Of course, backend storage must first be configured before enabling EFI persistence on a VM. See above. Here's a portion of a VM definition that includes a persistent EFI:
The boot entries/order can, and most likely will, get overriden by libvirt. This is to satisfy the VM specfications. Do not expect manual boot setting changes to persist.
"},{"location":"compute/resources_requests_and_limits/","title":"Resources requests and limits","text":"
In this document, we are talking about the resources values set on the virt-launcher compute container, referred to as \"the container\" below for simplicity.
Cluster admins can define a label selector in the KubeVirt CR. Once that label selector is defined, if the creation namespace matches the selector, all VM(I)s created in it will have a CPU limits set.
"},{"location":"compute/resources_requests_and_limits/#memory","title":"Memory","text":""},{"location":"compute/resources_requests_and_limits/#memory-requests-on-the-container","title":"Memory requests on the container","text":"
VM(I)s must specify a desired amount of memory, in either spec.domain.memory.guest or spec.domain.resources.requests.memory (ignoring hugepages, see the dedicated page). If both are set, the memory requests take precedence. A calculated amount of overhead will be added to it, forming the memory request value for the container.
"},{"location":"compute/resources_requests_and_limits/#memory-limits-on-the-container","title":"Memory limits on the container","text":"
By default, no memory limit is set on the container
If auto memory limits is enabled (see next section), then the container will have a limit of 2x the requested memory.
Manually setting a memory limit on the VM(I) will set the same value on the container
Memory limits have to be more than memory requests + overhead, otherwise the container will have memory requests > limits and be rejected by Kubernetes.
Memory usage bursts could lead to VM crashes when memory limits are set
KubeVirt provides a feature gate(AutoResourceLimitsGate) to automatically set memory limits on VM(I)s. By enabling this feature gate, memory limits will be added to the vmi if all the following conditions are true:
The namespace where the VMI will be created has a ResourceQuota containing memory limits.
The VMI has no manually set memory limits.
The VMI is not requesting dedicated CPU.
If all the previous conditions are true, the memory limits will be set to a value (2x) of the memory requests. This ratio can be adjusted, per namespace, by adding the annotation alpha.kubevirt.io/auto-memory-limits-ratio, with the desired custom value. For example, with alpha.kubevirt.io/auto-memory-limits-ratio: 1.2, the memory limits set will be equal to (1.2x) of the memory requests.
VirtualMachines have a Running setting that determines whether or not there should be a guest running or not. Because KubeVirt will always immediately restart a VirtualMachineInstance for VirtualMachines with spec.running: true, a simple boolean is not always enough to fully describe desired behavior. For instance, there are cases when a user would like the ability to shut down a guest from inside the virtual machine. With spec.running: true, KubeVirt would immediately restart the VirtualMachineInstance.
To allow for greater variation of user states, the RunStrategy field has been introduced. This is mutually exclusive with Running as they have somewhat overlapping conditions. There are currently four RunStrategies defined:
Always: The system is tasked with keeping the VM in a running state. This is achieved by respawning a VirtualMachineInstance whenever the current one terminated in a controlled (e.g. shutdown from inside the guest) or uncontrolled (e.g. crash) way. This behavior is equal to spec.running: true.
RerunOnFailure: Similar to Always, except that the VM is only restarted if it terminated in an uncontrolled way (e.g. crash) and due to an infrastructure reason (i.e. the node crashed, the KVM related process OOMed). This allows a user to determine when the VM should be shut down by initiating the shut down inside the guest. Note: Guest sided crashes (i.e. BSOD) are not covered by this. In such cases liveness checks or the use of a watchdog can help.
Once: The VM will run once and not be restarted upon completion regardless if the completion is of phase Failure or Success.
Manual: The system will not automatically turn the VM on or off, instead the user manually controlls the VM status by issuing start, stop, and restart commands on the VirtualMachine subresource endpoints.
Halted: The system is asked to ensure that no VM is running. This is achieved by stopping any VirtualMachineInstance that is associated ith the VM. If a guest is already running, it will be stopped. This behavior is equal to spec.running: false.
Note: RunStrategy and running are mutually exclusive, because they can be contradictory. The API server will reject VirtualMachine resources that define both.
The start, stop and restart methods of virtctl will invoke their respective subresources of VirtualMachines. This can have an effect on the runStrategy of the VirtualMachine as below:
RunStrategy start stop restart
Always
-
Halted
Always
RerunOnFailure
RerunOnFailure
RerunOnFailure
RerunOnFailure
Manual
Manual
Manual
Manual
Halted
Always
-
-
Table entries marked with - don't make sense, so won't have an effect on RunStrategy.
Fine-tuning different aspects of the hardware which are not device related (BIOS, mainboard, etc.) is sometimes necessary to allow guest operating systems to properly boot and reboot.
QEMU is able to work with two different classes of chipsets for x86_64, so called machine types. The x86_64 chipsets are i440fx (also called pc) and q35. They are versioned based on qemu-system-${ARCH}, following the format pc-${machine_type}-${qemu_version}, e.g.pc-i440fx-2.10 and pc-q35-2.10.
KubeVirt defaults to QEMU's newest q35 machine type. If a custom machine type is desired, it is configurable through the following structure:
Enabling EFI automatically enables Secure Boot, unless the secureBoot field under efi is set to false. Secure Boot itself requires the SMM CPU feature to be enabled as above, which does not happen automatically, for security reasons.
In order to provide a consistent view on the virtualized hardware for the guest OS, the SMBIOS UUID can be set to a constant value via spec.firmware.uuid:
"},{"location":"compute/virtual_hardware/#labeling-nodes-with-cpu-models-and-cpu-features","title":"Labeling nodes with cpu models and cpu features","text":"
KubeVirt can create node selectors based on VM cpu models and features. With these node selectors, VMs will be scheduled on the nodes that support the matching VM cpu model and features.
To properly label the node, user can use Kubevirt Node-labeller, which creates all necessary labels or create node labels by himself.
Kubevirt node-labeller creates 3 types of labels: cpu models, cpu features and kvm info. It uses libvirt to get all supported cpu models and cpu features on host and then Node-labeller creates labels from cpu models.
Node-labeller supports obsolete list of cpu models and minimal baseline cpu model for features. Both features can be set via KubeVirt CR:
Obsolete cpus will not be inserted in labels. If KubeVirt CR doesn't contain obsoleteCPUModels variable, Labeller sets default values (\"pentium, pentium2, pentium3, pentiumpro, coreduo, n270, core2duo, Conroe, athlon, phenom, kvm32, kvm64, qemu32 and qemu64\").
User can change obsoleteCPUModels by adding / removing cpu model in config map. Kubevirt then update nodes with new labels.
For homogenous cluster / clusters without live migration enabled it's possible to disable the node labeler and avoid adding labels to the nodes by adding the following annotation to the nodes:
Note: If CPU model wasn't defined, the VM will have CPU model closest to one that used on the node where the VM is running.
Note: CPU model is case sensitive.
Setting the CPU model is possible via spec.domain.cpu.model. The following VM will have a CPU with the Conroe model:
apiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this sets the CPU model\n model: Conroe\n...\n
You can check list of available models here.
When CPUNodeDiscovery feature-gate is enabled and VM has cpu model, Kubevirt creates node selector with format: cpu-model.node.kubevirt.io/<cpuModel>, e.g. cpu-model.node.kubevirt.io/Conroe. When VM doesn\u2019t have cpu model, then no node selector is created.
"},{"location":"compute/virtual_hardware/#enabling-default-cluster-cpu-model","title":"Enabling default cluster cpu model","text":"
To enable the default cpu model, user may add the cpuModel field in the KubeVirt CR.
Default CPU model is set when vmi doesn't have any cpu model. When vmi has cpu model set, then vmi's cpu model is preferred. When default cpu model is not set and vmi's cpu model is not set too, host-model will be set. Default cpu model can be changed when kubevirt is running. When CPUNodeDiscovery feature gate is enabled Kubevirt creates node selector with default cpu model.
"},{"location":"compute/virtual_hardware/#cpu-model-special-cases","title":"CPU model special cases","text":"
As special cases you can set spec.domain.cpu.model equals to: - host-passthrough to passthrough CPU from the node to the VM
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this passthrough the node CPU to the VM\n model: host-passthrough\n...\n
host-model to get CPU on the VM close to the node one
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this set the VM CPU close to the node one\n model: host-model\n...\n
Setting CPU features is possible via spec.domain.cpu.features and can contain zero or more CPU features :
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this sets the CPU features\n features:\n # this is the feature's name\n - name: \"apic\"\n # this is the feature's policy\n policy: \"require\"\n...\n
Note: Policy attribute can either be omitted or contain one of the following policies: force, require, optional, disable, forbid.
Note: In case a policy is omitted for a feature, it will default to require.
Behaviour according to Policies:
All policies will be passed to libvirt during virtual machine creation.
In case the feature gate \"CPUNodeDiscovery\" is enabled and the policy is omitted or has \"require\" value, then the virtual machine could be scheduled only on nodes that support this feature.
In case the feature gate \"CPUNodeDiscovery\" is enabled and the policy has \"forbid\" value, then the virtual machine would not be scheduled on nodes that support this feature.
Full description about features and policies can be found here.
When CPUNodeDiscovery feature-gate is enabled Kubevirt creates node selector from cpu features with format: cpu-feature.node.kubevirt.io/<cpuFeature>, e.g. cpu-feature.node.kubevirt.io/apic. When VM doesn\u2019t have cpu feature, then no node selector is created.
hpet is disabled,pit and rtc are configured to use a specific tickPolicy. Finally, hyperv is made available too.
See the Timer API Reference for all possible configuration options.
Note: Timer can be part of a machine type. Thus it may be necessary to explicitly disable them. We may in the future decide to add them via cluster-level defaulting, if they are part of a QEMU machine definition.
"},{"location":"compute/virtual_hardware/#random-number-generator-rng","title":"Random number generator (RNG)","text":"
You may want to use entropy collected by your cluster nodes inside your guest. KubeVirt allows to add a virtio RNG device to a virtual machine as follows.
For Linux guests, the virtio-rng kernel module should be loaded early in the boot process to acquire access to the entropy source. Other systems may require similar adjustments to work with the virtio RNG device.
Note: Some guest operating systems or user payloads may require the RNG device with enough entropy and may fail to boot without it. For example, fresh Fedora images with newer kernels (4.16.4+) may require the virtio RNG device to be present to boot to login.
"},{"location":"compute/virtual_hardware/#video-and-graphics-device","title":"Video and Graphics Device","text":"
By default a minimal Video and Graphics device configuration will be applied to the VirtualMachineInstance. The video device is vga compatible and comes with a memory size of 16 MB. This device allows connecting to the OS via vnc.
It is possible not attach it by setting spec.domain.devices.autoattachGraphicsDevice to false:
KubeVirt supports a range of virtualization features which may be tweaked in order to allow non-Linux based operating systems to properly boot. Most noteworthy are
acpi
apic
hyperv
A common feature configuration is shown by the following example:
See the Features API Reference for all available features and configuration options.
"},{"location":"compute/virtual_hardware/#resources-requests-and-limits","title":"Resources Requests and Limits","text":"
An optional resource request can be specified by the users to allow the scheduler to make a better decision in finding the most suitable Node to place the VM.
Specifying CPU limits will determine the amount of cpu shares set on the control group the VM is running in, in other words, the amount of time the VM's CPUs can execute on the assigned resources when there is a competition for CPU resources.
For more information please refer to how Pods with resource limits are run.
Various VM resources, such as a video adapter, IOThreads, and supplementary system software, consume additional memory from the Node, beyond the requested memory intended for the guest OS consumption. In order to provide a better estimate for the scheduler, this memory overhead will be calculated and added to the requested memory.
Please see how Pods with resource requests are scheduled for additional information on resource requests and limits.
KubeVirt give you possibility to use hugepages as backing memory for your VM. You will need to provide desired amount of memory resources.requests.memory and size of hugepages to use memory.hugepages.pageSize, for example for x86_64 architecture it can be 2Mi.
hugepages size cannot be bigger than requested memory
requested memory must be divisible by hugepages size
hugepages uses by default memfd. Memfd is supported from kernel >= 4.14. If you run on an older host (e.g centos 7.9), it is required to disable memfd with the annotation kubevirt.io/memfd: \"false\" in the VMI metadata annotation.
Kubevirt supports input devices. The only type which is supported is tablet. Tablet input device supports only virtio and usb bus. Bus can be empty. In that case, usb will be selected.
Right now KubeVirt uses virtio-serial for local guest-host communication. Currently it used in KubeVirt by libvirt and qemu to communicate with the qemu-guest-agent. Virtio-serial can also be used by other agents, but it is a little bit cumbersome due to:
A small set of ports on the virtio-serial device
Low bandwidth
No socket based communication possible, which requires every agent to establish their own protocols, or work with translation layers like SLIP to be able to use protocols like gRPC for reliable communication.
No easy and supportable way to get a virtio-serial socket assigned and being able to access it without entering the virt-launcher pod.
Due to the point above, privileges are required for services.
With virtio-vsock we get support for easy guest-host communication which solves the above issues from a user/admin perspective.
NOTE: The /dev/vhost-vsock device is NOT NEEDED to connect or bind to a VSOCK socket.
To make VSOCK feature secure, following measures are put in place:
The whole VSOCK features will live behind a feature gate.
By default the first 1024 ports of a vsock device are privileged. Services trying to bind to those require CAP_NET_BIND_SERVICE capability.
AF_VSOCK socket syscall gets blocked in containerd 1.7+ (containerd/containerd#7442). It is right now the responsibility of the vendor to ensure that the used CRI selects a default seccomp policy which blocks VSOCK socket calls in a similar way like it was done for containerd.
CIDs are assigned by virt-controller and are unique per Virtual Machine Instance to ensure that virt-handler has an easy way of tracking the identity without races. While this still allows virt-launcher to fake-use an assigned CID, it eliminates possible assignment races which attackers could make use-of to redirect VSOCK calls.
Purpose of this document is to explain how to install virtio drivers for Microsoft Windows running in a fully virtualized guest.
"},{"location":"compute/windows_virtio_drivers/#do-i-need-virtio-drivers","title":"Do I need virtio drivers?","text":"
Yes. Without the virtio drivers, you cannot use paravirtualized hardware properly. It would either not work, or will have a severe performance penalty.
For more information about VirtIO and paravirtualization, see VirtIO and paravirtualization
For more details on configuring your VirtIO driver please refer to Installing VirtIO driver on a new Windows virtual machine and Installing VirtIO driver on an existing Windows virtual machine.
"},{"location":"compute/windows_virtio_drivers/#which-drivers-i-need-to-install","title":"Which drivers I need to install?","text":"
There are usually up to 8 possible devices that are required to run Windows smoothly in a virtualized environment. KubeVirt currently supports only:
viostor, the block driver, applies to SCSI Controller in the Other devices group.
viorng, the entropy source driver, applies to PCI Device in the Other devices group.
NetKVM, the network driver, applies to Ethernet Controller in the Other devices group. Available only if a virtio NIC is configured.
Other virtio drivers, that exists and might be supported in the future:
Balloon, the balloon driver, applies to PCI Device in the Other devices group
vioserial, the paravirtual serial driver, applies to PCI Simple Communications Controller in the Other devices group.
vioscsi, the SCSI block driver, applies to SCSI Controller in the Other devices group.
qemupciserial, the emulated PCI serial driver, applies to PCI Serial Port in the Other devices group.
qxl, the paravirtual video driver, applied to Microsoft Basic Display Adapter in the Display adapters group.
pvpanic, the paravirtual panic driver, applies to Unknown device in the Other devices group.
Note
Some drivers are required in the installation phase. When you are installing Windows onto the virtio block storage you have to provide an appropriate virtio driver. Namely, choose viostor driver for your version of Microsoft Windows, eg. does not install XP driver when you run Windows 10.
Other drivers can be installed after the successful windows installation. Again, please install only drivers matching your Windows version.
"},{"location":"compute/windows_virtio_drivers/#how-to-install-during-windows-install","title":"How to install during Windows install?","text":"
To install drivers before the Windows starts its install, make sure you have virtio-win package attached to your VirtualMachine as SATA CD-ROM. In the Windows installation, choose advanced install and load driver. Then please navigate to loaded Virtio CD-ROM and install one of viostor or vioscsi, depending on whichever you have set up.
Step by step screenshots:
"},{"location":"compute/windows_virtio_drivers/#how-to-install-after-windows-install","title":"How to install after Windows install?","text":"
After windows install, please go to Device Manager. There you should see undetected devices in \"available devices\" section. You can install virtio drivers one by one going through this list.
For more details on how to choose a proper driver and how to install the driver, please refer to the Windows Guest Virtual Machines on Red Hat Enterprise Linux 7.
"},{"location":"compute/windows_virtio_drivers/#how-to-obtain-virtio-drivers","title":"How to obtain virtio drivers?","text":"
The virtio Windows drivers are distributed in a form of containerDisk, which can be simply mounted to the VirtualMachine. The container image, containing the disk is located at: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags and the image be pulled as any other docker container:
However, pulling image manually is not required, it will be downloaded if not present by Kubernetes when deploying VirtualMachine.
"},{"location":"compute/windows_virtio_drivers/#attaching-to-virtualmachine","title":"Attaching to VirtualMachine","text":"
KubeVirt distributes virtio drivers for Microsoft Windows in a form of container disk. The package contains the virtio drivers and QEMU guest agent. The disk was tested on Microsoft Windows Server 2012. Supported Windows version is XP and up.
The package is intended to be used as CD-ROM attached to the virtual machine with Microsoft Windows. It can be used as SATA CDROM during install phase or to provide drivers in an existing Windows installation.
Attaching the virtio-win package can be done simply by adding ContainerDisk to you VirtualMachine.
spec:\n domain:\n devices:\n disks:\n - name: virtiocontainerdisk\n # Any other disk you want to use, must go before virtioContainerDisk.\n # KubeVirt boots from disks in order ther are defined.\n # Therefore virtioContainerDisk, must be after bootable disk.\n # Other option is to choose boot order explicitly:\n # - https://kubevirt.io/api-reference/v0.13.2/definitions.html#_v1_disk\n # NOTE: You either specify bootOrder explicitely or sort the items in\n # disks. You can not do both at the same time.\n # bootOrder: 2\n cdrom:\n bus: sata\nvolumes:\n - containerDisk:\n image: quay.io/kubevirt/virtio-container-disk\n name: virtiocontainerdisk\n
Once you are done installing virtio drivers, you can remove virtio container disk by simply removing the disk from yaml specification and restarting the VirtualMachine.
KubeVirt produces a lot of logging throughout its codebase. Some log entries have a verbosity level defined to them. The verbosity level that's defined for a log entry determines the minimum verbosity level in order to expose the log entry.
In code, the log entry looks similar to: log.Log.V(verbosity).Infof(\"...\") while verbosity is the minimum verbosity level for this entry.
For example, if the log verbosity for some log entry is 3, then the log would be exposed only if the log verbosity is defined to be equal or greater than 3, or else it would be filtered out.
Currently, log verbosity can be defined per-component or per-node. The most updated API is detailed here.
"},{"location":"debug_virt_stack/debug/#setting-verbosity-per-kubevirt-component","title":"Setting verbosity per KubeVirt component","text":"
One way of raising log verbosity is to manually determine it for the different components in KubeVirt CR:
nodeVerbosity is essentially a map from string to int where the key is the node name and the value is the verbosity level. The verbosity level would be defined for all the different components in that node (e.g. virt-handler, virt-launcher, etc).
"},{"location":"debug_virt_stack/debug/#how-to-retrieve-kubevirt-components-logs","title":"How to retrieve KubeVirt components' logs","text":"
In Kubernetes, logs are defined at the Pod level. Therefore, first it's needed to list the Pods of KubeVirt's core components. In order to do that we can first list the Pods under KubeVirt's install namespace.
Then, we can pick one of the pods and fetch its logs. For example:
$> kubectl logs -n <KubeVirt Install Namespace> virt-handler-2m86x | head -n8\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"set verbosity to 2\",\"pos\":\"virt-handler.go:453\",\"timestamp\":\"2022-04-17T08:58:37.373695Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"set verbosity to 2\",\"pos\":\"virt-handler.go:453\",\"timestamp\":\"2022-04-17T08:58:37.373726Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"setting rate limiter to 5 QPS and 10 Burst\",\"pos\":\"virt-handler.go:462\",\"timestamp\":\"2022-04-17T08:58:37.373782Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"CPU features of a minimum baseline CPU model: map[apic:true clflush:true cmov:true cx16:true cx8:true de:true fpu:true fxsr:true lahf_lm:true lm:true mca:true mce:true mmx:true msr:true mtrr:true nx:true pae:true pat:true pge:true pni:true pse:true pse36:true sep:true sse:true sse2:true sse4.1:true ssse3:true syscall:true tsc:true]\",\"pos\":\"cpu_plugin.go:96\",\"timestamp\":\"2022-04-17T08:58:37.390221Z\"}\n{\"component\":\"virt-handler\",\"level\":\"warning\",\"msg\":\"host model mode is expected to contain only one model\",\"pos\":\"cpu_plugin.go:103\",\"timestamp\":\"2022-04-17T08:58:37.390263Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"node-labeller is running\",\"pos\":\"node_labeller.go:94\",\"timestamp\":\"2022-04-17T08:58:37.391011Z\"}\n
Obviously, for both examples above, <KubeVirt Install Namespace> needs to be replaced with the actual namespace KubeVirt is installed in.
Using the cluster-profiler client tool, a developer can get the PProf profiling data for every component in the Kubevirt Control plane. Here is a user guide:
"},{"location":"debug_virt_stack/launch-qemu-gdb/","title":"Launch QEMU with gdb and connect locally with gdb client","text":"
This guide is for cases where QEMU counters very early failures and it is hard to synchronize it in a later point in time.
"},{"location":"debug_virt_stack/launch-qemu-gdb/#image-creation-and-pvc-population","title":"Image creation and PVC population","text":"
This scenario is a slight variation of the guide about starting strace, hence some of the details on the image build and the PVC population are simply skipped and explained in the other section.
In this example, QEMU will be launched with gdbserver and later we will connect to it using a local gdb client.
In this scenario, we use an additional container image containing gdb and the same qemu binary as the target process to debug. This image will be run locally with podman.
In order to build this image, we need to identify the image of the virt-launcher container we want to debug. Based on the KubeVirt installation, the namespace and the name of the KubeVirt CR could vary. In this example, we'll assume that KubeVirt CR is called kubevirt and installed in the kubevirt namespace.
You can easily find out the right names in your cluster by searching with:
$ kubectl get kubevirt -A\nNAMESPACE NAME AGE PHASE\nkubevirt kubevirt 3h11m Deployed\n
The steps to build the image are:
Get the registry of the images of the KubeVirt installation:
Podman will replace the registry and tag arguments provided on the command line. In this way, we can specify the image registry and shasum for the KubeVirt version to debug.
"},{"location":"debug_virt_stack/launch-qemu-gdb/#run-the-vm-to-troubleshoot","title":"Run the VM to troubleshoot","text":"
For this example, we add an annotation to keep the virt-launcher pod running even if any errors occur:
$ kubectl apply -f debug-vmi.yaml\nvirtualmachineinstance.kubevirt.io/vmi-debug-tools created\n$ kubectl get vmi\nNAME AGE PHASE IP NODENAME READY\nvmi-debug-tools 28s Scheduled node01 False\n$ kubectl get po\nNAME READY STATUS RESTARTS AGE\npopulate-pvc-dnxld 0/1 Completed 0 4m17s\nvirt-launcher-vmi-debug-tools-tfh28 4/4 Running 0 25s\n
The wrapping script starts the gdbserver and expose in the port 1234 inside the container. In order to be able to connect remotely to the gdbserver, we can use the command kubectl port-forward to expose the gdb port on our machine.
$ kubectl port-forward virt-launcher-vmi-debug-tools-tfh28 1234\nForwarding from 127.0.0.1:1234 -> 1234\nForwarding from [::1]:1234 -> 1234\n
Finally, we can start the gbd client in the container:
$ podman run -ti --network host gdb-client:latest\n$ gdb /usr/libexec/qemu-kvm -ex 'target remote localhost:1234'\nGNU gdb (GDB) Red Hat Enterprise Linux 10.2-12.el9\nCopyright (C) 2021 Free Software Foundation, Inc.\nLicense GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\nThis is free software: you are free to change and redistribute it.\nThere is NO WARRANTY, to the extent permitted by law.\nType \"show copying\" and \"show warranty\" for details.\nThis GDB was configured as \"x86_64-redhat-linux-gnu\".\nType \"show configuration\" for configuration details.\nFor bug reporting instructions, please see:\n<https://www.gnu.org/software/gdb/bugs/>.\nFind the GDB manual and other documentation resources online at:\n <http://www.gnu.org/software/gdb/documentation/>.\n\nFor help, type \"help\".\n--Type <RET> for more, q to quit, c to continue without paging--\nType \"apropos word\" to search for commands related to \"word\"...\nReading symbols from /usr/libexec/qemu-kvm...\n\nReading symbols from /root/.cache/debuginfod_client/26221a84fabd219a68445ad0cc87283e881fda15/debuginfo...\nRemote debugging using localhost:1234\nReading /lib64/ld-linux-x86-64.so.2 from remote target...\nwarning: File transfers from remote targets can be slow. Use \"set sysroot\" to access files locally instead.\nReading /lib64/ld-linux-x86-64.so.2 from remote target...\nReading symbols from target:/lib64/ld-linux-x86-64.so.2...\nDownloading separate debug info for /system-supplied DSO at 0x7ffc10eff000...\n0x00007f1a70225e70 in _start () from target:/lib64/ld-linux-x86-64.so.2\n
For simplicity, we started podman with the option --network host in this way, the container is able to access any port mapped on the host.
"},{"location":"debug_virt_stack/launch-qemu-strace/","title":"Launch QEMU with strace","text":"
This guide explains how launch QEMU with a debugging tool in virt-launcher pod. This method can be useful to debug early failures or starting QEMU as a child of the debug tool relying on ptrace. The second point is particularly relevant when a process is operating in a non-privileged environment since otherwise, it would need root access to be able to ptrace the process.
Ephemeral containers are among the emerging techniques to overcome the lack of debugging tool inside the original image. This solution does, however, come with a number of limitations. For example, it is possible to spawn a new container inside the same pod of the application to debug and share the same PID namespace. Though they share the same PID namespace, KubeVirt's usage of unprivileged containers makes it, for example, impossible to ptrace a running container. Therefore, this technique isn't appropriate for our needs.
Due to its security and image size reduction, KubeVirt container images are based on distroless containers. These kinds of images are extremely beneficial for deployments, but they are challenging to troubleshoot because there is no package management, which prevents the installation of additional tools on the flight.
Wrapping the QEMU binary in a script is one practical method for debugging QEMU launched by Libvirt. This script launches the QEMU as a child of this process together with the debugging tool (such as strace or valgrind).
The final part that needs to be added is the configuration for Libvirt to use the wrapped script rather than calling the QEMU program directly.
It is possible to alter the generated XML with the help of KubeVirt sidecars. This allows us to use the wrapping script in place of the built-in emulator.
The primary concept behind this configuration is that all of the additional tools, scripts, and final output files will be stored in a PerstistentVolumeClaim (PVC) that this guide refers to as debug-tools. The virt-launcher pod that we wish to debug will have this PVC attached to it.
In this guide, we'll apply the above concepts to debug QEMU inside virt-launcher using strace without the need of build a custom virt-launcher image.
You can see a full demo of this setup:
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-bring-the-debug-tools-and-wrapping-script-into-distroless-containers","title":"How to bring the debug tools and wrapping script into distroless containers","text":"
This section provides an example of how to provide extra tools into the distroless container that will be supplied as a PVC using a Dockerfile. Although there are several ways to accomplish this, this covers a relatively simple technique. Alternatively, you could run a pod and manually populate the PVC by execing into the pod.
Dockerfile:
FROM quay.io/centos/centos:stream9 as build\n\nENV DIR /debug-tools\nRUN mkdir -p ${DIR}/logs\n\nRUN yum install --installroot=${DIR} -y strace && yum clean all\n\nCOPY ./wrap_qemu_strace.sh $DIR/wrap_qemu_strace.sh\nRUN chmod 0755 ${DIR}/wrap_qemu_strace.sh\nRUN chown 107:107 ${DIR}/wrap_qemu_strace.sh\nRUN chown 107:107 ${DIR}/logs\n
The directory debug-tools stores the content that will be later copied inside the debug-tools PVC. We are essentially adding the missing utilities in the custom directory with yum install --installroot=${DIR}}, and the parent image matches with the parent images of virt-launcher.
The wrap_qemu_strace.sh is the wrapping script that will be used to launch QEMU with strace similarly as the example with valgrind.
It is important to set the dynamic library path LD_LIBRARY_PATH to the path where the PVC will be mounted in the virt-launcher container.
Then, you will simply need to build the image and your debug setup is ready. The Dockerfle and the script wrap_qemu_strace.sh need to be in the same directory where you run the command.
$ podman build -t debug .\n
The second step is to populate the PVC. This can be easily achieved using a kubernetes Job like:
The image referenced in the Job is the image we built in the previous step. Once applied this and the job completed, thedebug-tools PVC is ready to be used.
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-start-qemu-launched-by-a-debugging-tool-eg-strace","title":"How to start qemu launched by a debugging tool (e.g strace)","text":"
This part is achieved by using ConfigMaps and a KubeVirt sidecar (more details in the section Using ConfigMap to run custom script).
The script that replaces the QEMU binary with the wrapping script in the XML is stored in the configmap my-config-map. This script will run as a hook, as explained in full in the documentation for the KubeVirt sidecar.
Once all the objects created, we can finally run the guest to debug.
The VMI example is a simply VM instance declaration and the interesting parts are the annotations for the hook: * image refers to the sidecar-shim already built and shipped with KubeVirt * pvc refers to the PVC populated with the debug setup. The name refers to the claim name, the volumePath is the path inside the sidecar container where the volume is mounted while the sharedComputePath is the path of the same volume inside the compute container. * configMap refers to the confimap containing the script to modify the XML for the wrapping script
Once the VM is declared, the hook will modify the emulator section and Libvirt will call the wrapping script instead of QEMU directly.
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-fetch-the-output","title":"How to fetch the output","text":"
The wrapping script configures strace to store the output in the PVC. In this way, it is possible to retrieve the output file in a later time, for example using an additional pod like:
"},{"location":"debug_virt_stack/logging/","title":"Control libvirt logging for each component","text":"
Generally, cluster admins can control the log verbosity of each KubeVirt component in KubeVirt CR. For more details, please, check the KubeVirt documentation.
Nonetheless, regular users can also adjust the qemu component logging to have a finer control over it. The annotation kubevirt.io/libvirt-log-filters enables you to modify each component's log level.
The annotation enables the filter from the container creation. However, in certain cases you might desire to change the logging level dynamically once the container and libvirt have already been started. In this case, virt-admin comes to the rescue.
Otherwise, if you prefer to redirect the output to a file and fetch it later, you can rely on kubectl cp to retrieve the file. In this case, we are saving the file in the /var/run/libvirt directory because the compute container has the permissions to write there.
Example:
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-ephemeral-nqcld 3/3 Running 0 26m\n$ kubectl exec -ti virt-launcher-vmi-ephemeral-nqcld -- virt-admin -c virtqemud:///session daemon-log-outputs \"1:file:/var/run/libvirt/libvirtd.log\"\n$ kubectl cp virt-launcher-vmi-ephemeral-nqcld:/var/run/libvirt/libvirtd.log libvirt-kubevirt.log\ntar: Removing leading `/' from member names\n
"},{"location":"debug_virt_stack/privileged-node-debugging/","title":"Privileged debugging on the node","text":"
This article describes the scenarios in which you can create privileged pods and have root access to the cluster nodes.
With privileged pods, you may access devices in /dev, utilize host namespaces and ptrace processes that are running on the node, and use the hostPath volume to mount node directories in the container.
A quick way to verify if you are allowed to create privileged pods is to create a sample pod with the --dry-run=server option, like:
"},{"location":"debug_virt_stack/privileged-node-debugging/#build-the-container-image","title":"Build the container image","text":"
KubeVirt uses distroless containers and those images don't have a package manager, for this reason it isn't possible to use the image as parent for installing additional packages.
In certain debugging scenarios, the tools require to have exactly the same binary available. However, if the debug tools are operating in a different container, this can be especially difficult as the filesystems of the containers are isolated.
This section will cover how to build a container image with the debug tools plus binaries of the KubeVirt version you want to debug.
Based on your installation the namespace and the name of the KubeVirt CR could vary. In this example, we'll assume that KubeVirt CR is called kubevirt and installed in the kubevirt namespace. You can easily find out how it is called in your cluster by searching with kubectl get kubevirt -A. This is necessary as we need to retrieve the original virt-launcher image to have exactly the same QEMU binary we want to debug.
Get the registry of the images of the KubeVirt installation:
The privileged option is required to have access to mostly all the resources on the node.
The nodeName ensures that the debugging pod will be scheduled on the desired node. In order to select the right now, you can use the -owide option with kubectl get po and this will report the nodes where the pod is running.
Example:
k get pods -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nlocal-volume-provisioner-4jtkb 1/1 Running 0 152m 10.244.196.129 node01 <none> <none>\nnode01-debug 1/1 Running 0 44m 192.168.66.101 node01 <none> <none>\nvirt-launcher-vmi-ephemeral-xg98p 3/3 Running 0 2m54s 10.244.196.148 node01 <none> 1/1\n
In the volumes section, you can specify the directories you want to be directly mounted in the debugging container. For example, /usr/lib/modules is particularly useful if you need to load some kernel modules.
Sharing the host pid namespace with the option hostPID allows you to see all the processes on the node and attach to it with tools like gdb and strace.
exec-ing into the pod gives you a shell with privileged access to the node plus the tooling you installed into the image:
$ kubectl exec -ti debug -- bash\n
The following examples assume you have already execed into the node01-debug pod.
"},{"location":"debug_virt_stack/privileged-node-debugging/#validating-the-host-for-virtualization","title":"Validating the host for virtualization","text":"
The tool vist-host-validate is utility to validate the host to run libvirt hypervisor. This, for example, can be used to check if a particular node is kvm capable.
Example:
$ virt-host-validate\n QEMU: Checking for hardware virtualization : PASS\n QEMU: Checking if device /dev/kvm exists : PASS\n QEMU: Checking if device /dev/kvm is accessible : PASS\n QEMU: Checking if device /dev/vhost-net exists : PASS\n QEMU: Checking if device /dev/net/tun exists : PASS\n QEMU: Checking for cgroup 'cpu' controller support : PASS\n QEMU: Checking for cgroup 'cpuacct' controller support : PASS\n QEMU: Checking for cgroup 'cpuset' controller support : PASS\n QEMU: Checking for cgroup 'memory' controller support : PASS\n QEMU: Checking for cgroup 'devices' controller support : PASS\n QEMU: Checking for cgroup 'blkio' controller support : PASS\n QEMU: Checking for device assignment IOMMU support : PASS\n QEMU: Checking if IOMMU is enabled by kernel : PASS\n QEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure\n
"},{"location":"debug_virt_stack/privileged-node-debugging/#run-a-command-directly-on-the-node","title":"Run a command directly on the node","text":"
The debug container has in the volume section the host filesystem mounted under /host. This can be particularly useful if you want to access the node filesystem or execute a command directly on the host. However, the tool needs already to be present on the node.
# chroot /host\nsh-5.1# cat /etc/os-release\nNAME=\"CentOS Stream\"\nVERSION=\"9\"\nID=\"centos\"\nID_LIKE=\"rhel fedora\"\nVERSION_ID=\"9\"\nPLATFORM_ID=\"platform:el9\"\nPRETTY_NAME=\"CentOS Stream 9\"\nANSI_COLOR=\"0;31\"\nLOGO=\"fedora-logo-icon\"\nCPE_NAME=\"cpe:/o:centos:centos:9\"\nHOME_URL=\"https://centos.org/\"\nBUG_REPORT_URL=\"https://bugzilla.redhat.com/\"\nREDHAT_SUPPORT_PRODUCT=\"Red Hat Enterprise Linux 9\"\nREDHAT_SUPPORT_PRODUCT_VERSION=\"CentOS Stream\"\n
"},{"location":"debug_virt_stack/privileged-node-debugging/#attach-to-a-running-process-eg-strace-or-gdb","title":"Attach to a running process (e.g strace or gdb)","text":"
This requires the field hostPID: true in this way you are able to list all the processes running on the node.
"},{"location":"debug_virt_stack/privileged-node-debugging/#debugging-using-crictl","title":"Debugging using crictl","text":"
Crictl is a cli for CRI runtimes and can be particularly useful to troubleshoot container failures (for a more detailed guide, please refer to this Kubernetes article).
In this example, we'll concentrate to find where libvirt creates the files and directory in the compute container of the virt-launcher pod.
"},{"location":"debug_virt_stack/virsh-commands/","title":"Execute virsh commands in virt-launcher pod","text":"
A powerful utility to check and troubleshoot the VM state is virsh and the utility is already installed in the compute container on the virt-launcher pod.
For example, it possible to run any QMP commands.
For a full list of QMP command, please refer to the QEMU documentation.
Then, you can, for example, pause and then unpause the guest and check the triggered events:
$ virtctl pause vmi vmi-ephemeral\nVMI vmi-ephemeral was scheduled to pause\n $ virtctl unpause vmi vmi-ephemeral\nVMI vmi-ephemeral was scheduled to unpause\n
From the monitored events:
$ kubectl exec -ti virt-launcher-vmi-ephemeral-nqcld -- virsh qemu-monitor-event --pretty --loop\nevent STOP at 1698405797.422823 for domain 'default_vmi-ephemeral': <null>\nevent RESUME at 1698405823.162458 for domain 'default_vmi-ephemeral': <null>\n
In order to create unique DNS records per VirtualMachineInstance, it is possible to set spec.hostname and spec.subdomain. If a subdomain is set and a headless service with a name, matching the subdomain, exists, kube-dns will create unique DNS entries for every VirtualMachineInstance which matches the selector of the service. Have a look at the DNS for Services and Pods documentation for additional information.
The following example consists of a VirtualMachine and a headless Service which matches the labels and the subdomain of the VirtualMachineInstance:
As a consequence, when we enter the VirtualMachineInstance via e.g. virtctl console vmi-fedora and ping myvmi.mysubdomain we see that we find a DNS entry for myvmi.mysubdomain.default.svc.cluster.local which points to 10.244.0.57, which is the IP of the VirtualMachineInstance (not of the Service):
[fedora@myvmi ~]$ ping myvmi.mysubdomain\nPING myvmi.mysubdomain.default.svc.cluster.local (10.244.0.57) 56(84) bytes of data.\n64 bytes from myvmi.mysubdomain.default.svc.cluster.local (10.244.0.57): icmp_seq=1 ttl=64 time=0.029 ms\n[fedora@myvmi ~]$ ip a\n2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000\n link/ether 0a:58:0a:f4:00:39 brd ff:ff:ff:ff:ff:ff\n inet 10.244.0.57/24 brd 10.244.0.255 scope global dynamic eth0\n valid_lft 86313556sec preferred_lft 86313556sec\n inet6 fe80::858:aff:fef4:39/64 scope link\n valid_lft forever preferred_lft forever\n
So spec.hostname and spec.subdomain get translated to a DNS A-record of the form <vmi.spec.hostname>.<vmi.spec.subdomain>.<vmi.metadata.namespace>.svc.cluster.local. If no spec.hostname is set, then we fall back to the VirtualMachineInstance name itself. The resulting DNS A-record looks like this then: <vmi.metadata.name>.<vmi.spec.subdomain>.<vmi.metadata.namespace>.svc.cluster.local.
Adding an interface to a KubeVirt Virtual Machine requires first an interface to be added to a running pod. This is not trivial, and has some requirements:
Multus Dynamic Networks Controller: this daemon will listen to annotation changes, and trigger Multus to configure a new attachment for this pod.
Multus CNI running as a thick plugin: this Multus version exposes an endpoint to create attachments for a given pod on demand.
Note: For older Kubevirt versions (from v1.1 until v1.3), the HotplugNICs feature-gate) must be enabled. From Kubevirt v1.4, the FG is not needed and should be removed if set.
"},{"location":"network/hotplug_interfaces/#adding-an-interface-to-a-running-vm","title":"Adding an interface to a running VM","text":"
First start a VM. You can refer to the following example:
You should configure a network attachment definition - where the pod interface configuration is held. The snippet below shows an example of a very simple one:
Please refer to the Multus documentation for more information.
Once the virtual machine is running, and the attachment configuration provisioned, the user can request the interface hotplug operation by editing the VM spec template and adding the desired interface and network:
Note: virtctladdinterface and removeinterface commands are no longer available, hotplug/unplug interfaces is done by editing the VM spec template.
The interface and network will be added to the corresponding VMI object as well by Kubevirt.
You can now check the VMI status for the presence of this new interface:
kubectl get vmi vm-fedora -ojsonpath=\"{ @.status.interfaces }\"\n
"},{"location":"network/hotplug_interfaces/#removing-an-interface-from-a-running-vm","title":"Removing an interface from a running VM","text":"
Following the example above, the user can request an interface unplug operation by editing the VM spec template and set the desired interface state to absent:
The interface in the corresponding VMI object will be set with state 'absent' as well by Kubevirt.
Note: Existing VMs from version v0.59.0 and below do not support hot-unplug interfaces.
"},{"location":"network/hotplug_interfaces/#migration-based-hotplug","title":"Migration based hotplug","text":"
In case your cluster doesn't run Multus as thick plugin and Multus Dynamic Networks controller, it's possible to hotplug an interface by migrating the VM.
The actual attachment won't take place immediately, and the new interface will be available in the guest once the migration is completed.
"},{"location":"network/hotplug_interfaces/#add-new-interface","title":"Add new interface","text":"
Add the desired interface and network to the VM spec template:
Please refer to the Live Migration documentation for more information.
Once the migration is completed the VM will have the new interface attached.
Note: It is recommended to avoid performing migrations in parallel to a hotplug operation. It is safer to assure hotplug succeeded or at least reached the VMI specification before issuing a migration.
Please refer to the Live Migration documentation for more information.
Once the VM is migrated, the interface will not exist in the migration target pod.
Note: It is recommended to avoid performing migrations in parallel to an unplug operation. It is safer to assure unplug succeeded or at least reached the VMI specification before issuing a migration.
Please refer to the Live Migration documentation for more information.
Once the VM is migrated, the interface will not exist in the migration target pod. Due to limitation of Kubernetes device plugin API to allocate resources dynamically, the SR-IOV device plugin cannot allocate additional SR-IOV resources for Kubevirt to hotplug. Thus, SR-IOV interface hotplug is limited to migration based hotplug only, regardless of Multus \"thick\" version.
The hotplugged interfaces have model: virtio. This imposes several limitations: each interface will consume a PCI slot in the VM, and there are a total maximum of 32. Furthermore, other devices will also use these PCI slots (e.g. disks, guest-agent, etc).
Kubevirt reserves resources for 4 interface to allow later hotplug operations. The actual maximum amount of available resources depends on the machine type (e.g. q35 adds another PCI slot). For more information on maximum limits, see libvirt documentation.
Yet, upon a VM restart, the hotplugged interface will become part of the standard networks; this mitigates the maximum hotplug interfaces (per machine type) limitation.
Note: The user can execute this command against a stopped VM - i.e. a VM without an associated VMI. When this happens, KubeVirt mutates the VM spec template on behalf of the user.
"},{"location":"network/interfaces_and_networks/","title":"Interfaces and Networks","text":"
Connecting a virtual machine to a network consists of two parts. First, networks are specified in spec.networks. Then, interfaces backed by the networks are added to the VM by specifying them in spec.domain.devices.interfaces.
Each interface must have a corresponding network with the same name.
An interface defines a virtual network interface of a virtual machine. A network specifies the backend of an interface and declares which logical or physical device it is connected to.
There are multiple ways of configuring an interface as well as a network.
All possible configuration options are available in the Interface API Reference and Network API Reference.
Networks are configured in VMs spec.template.spec.networks. A network must have a unique name.
Each network should declare its type by defining one of the following fields:
Type Description pod Default Kubernetes network multus Secondary network provided using Multus or Primary network when Multus is defined as default"},{"location":"network/interfaces_and_networks/#pod","title":"pod","text":"
Represents the default (aka primary) pod interface (typically eth0) configured by cluster network solution that is present in each pod. The main advantage of this network type is that it is native to Kubernetes, allowing VMs to benefit from all network services provided by Kubernetes.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n masquerade: {}\n networks:\n - name: default\n pod: {} # Stock pod network\n
Secondary networks in Kubernetes allow pods to connect to additional networks beyond the default network, enabling more complex network topologies. These secondary networks are supported by meta-plugins like Multus, which let each pod attach to multiple network interfaces. Kubevirt support the connection of VMs to secondary networks using Multus. This assumes that multus is installed across your cluster and a corresponding NetworkAttachmentDefinition CRD was created.
The following example defines a secondary network which uses the bridge CNI plugin, which will connect the VM to Linux bridge br10. Other CNI plugins such as ptp, bridge-cni or sriov-cni might be used as well. For their installation and usage refer to the respective project documentation.
First the NetworkAttachmentDefinition needs to be created. That is usually done by an administrator. Users can then reference the definition.
With following definition, the VM will be connected to the default pod network and to the secondary bridge network, referencing the NetworkAttachmentDefinition shown above(in the same namespace)
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n masquerade: {}\n - name: bridge-net\n bridge: {}\n networks:\n - name: default\n pod: {} # Stock pod network\n - name: bridge-net\n multus: # Secondary multus network\n networkName: linux-bridge-net-ipam #ref to NAD name\n
"},{"location":"network/interfaces_and_networks/#multus-as-primary-network-provider","title":"Multus as primary network provider","text":"
It is also possible to define a multus network as the default pod network by indicating the VM's spec.template.spec.networks.multus.default=true. See Multus documentation for further information
Note: that a multus default network and a pod network type are mutually exclusive
The multus delegate chosen as default must return at least one IP address.
Network interfaces are configured in spec.domain.devices.interfaces. They describe properties of virtual interfaces as \"seen\" inside guest instances. The same network may be connected to a virtual machine in multiple different ways, each with their own connectivity guarantees and characteristics.
Note networks and interfaces must have a one-to-one relationship
The mandatory interface configuration includes: - A name, which references a network name - The name of supported network core binding from the table below, or a reference to a network binding plugin.
Type Description bridge Connect using a linux bridge sriov Connect using a passthrough SR-IOV VF via vfio masquerade Connect using nftables rules to NAT the traffic both egress and ingress
Each interface may also have additional configuration fields that modify properties \"seen\" inside guest instances, as listed below:
Name Format Default value Description model One of: e1000, e1000e, ne2k_pci, pcnet, rtl8139, virtiovirtio NIC type. Note: Use e1000 model if your guest image doesn't ship with virtio drivers macAddress ff:ff:ff:ff:ff:ff or FF-FF-FF-FF-FF-FF MAC address as seen inside the guest system, for example: de:ad:00:00:be:af ports empty (i.e. all ports) Allow-list of ports to be forwarded to the virtual machine pciAddress 0000:81:00.1 Set network interface PCI address, for example: 0000:81:00.1
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n model: e1000 # expose e1000 NIC to the guest\n masquerade: {} # connect through a masquerade\n ports:\n - name: http\n port: 80 # allow only http traffic ingress\n networks:\n - name: default\n pod: {}\n
Note: For secondary interfaces, when a MAC address is specified for a virtual machine interface, it is passed to the underlying CNI plugin which is, in turn, expected to configure the network provider to allow for this particular MAC. Not every plugin has native support for custom MAC addresses.
Note: For some CNI plugins without native support for custom MAC addresses, there is a workaround, which is to use the tuning CNI plugin to adjust pod interface MAC address. This can be used as follows:
Name Format Required Description name no Name port 1 - 65535 yes Port to expose protocol TCP,UDP no Connection protocol
If spec.domain.devices.interfaces is omitted, the virtual machine is connected using the default pod network interface of bridge type. If you'd like to have a virtual machine instance without any network connectivity, you can use the autoattachPodInterface field as follows:
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n autoattachPodInterface: false\n
In bridge mode, virtual machines are connected to the network backend through a linux \"bridge\". The pod network IPv4 address (if exists) is delegated to the virtual machine via DHCPv4. The virtual machine should be configured to use DHCP to acquire IPv4 addresses.
Note: If a specific MAC address is not configured in the virtual machine interface spec the MAC address from the relevant pod interface is delegated to the virtual machine.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n bridge: {} # connect through a bridge\n networks:\n - name: red\n multus:\n networkName: red\n
At this time, bridge mode doesn't support additional configuration fields.
Note: due to IPv4 address delegation, in bridge mode the pod doesn't have an IP address configured, which may introduce issues with third-party solutions that may rely on it. For example, Istio may not work in this mode.
Note: admin can forbid using bridge interface type for pod networks via a designated configuration flag. To achieve it, the admin should set the following option to false:
Note: binding the pod network using bridge interface type may cause issues. Other than the third-party issue mentioned in the above note, live migration is not allowed with a pod network binding of bridge interface type, and also some CNI plugins might not allow to use a custom MAC address for your VM instances. If you think you may be affected by any of issues mentioned above, consider changing the default interface type to masquerade, and disabling the bridge type for pod network, as shown in the example above.
In masquerade mode, KubeVirt allocates internal IP addresses to virtual machines and hides them behind NAT. All the traffic exiting virtual machines is \"source NAT'ed\" using pod IP addresses; thus, cluster workloads should use the pod's IP address to contact the VM over this interface. This IP address is reported in the VMI's status.interfaces. A guest operating system should be configured to use DHCP to acquire IPv4 addresses.
To allow the VM to live-migrate or hard restart (both cause the VM to run on a different pod, with a different IP address) and still be reachable, it should be exposed by a Kubernetes service.
To allow traffic of specific ports into virtual machines, the template ports section of the interface should be configured as follows. If the ports section is missing, all ports forwarded into the VM.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n masquerade: {} # connect using masquerade mode\n ports:\n - port: 80 # allow incoming traffic on port 80 to get into the virtual machine\n networks:\n - name: red\n pod: {}\n
Note: Masquerade is only allowed to connect to the pod network.
Note: The network CIDR can be configured in the pod network section using the vmNetworkCIDR attribute.
"},{"location":"network/interfaces_and_networks/#masquerade-ipv4-and-ipv6-dual-stack-support","title":"masquerade - IPv4 and IPv6 dual-stack support","text":"
masquerade mode can be used in IPv4 and IPv6 dual-stack clusters to provide a VM with an IP connectivity over both protocols.
As with the IPv4 masquerade mode, the VM can be contacted using the pod's IP address - which will be in this case two IP addresses, one IPv4 and one IPv6. Outgoing traffic is also \"NAT'ed\" to the pod's respective IP address from the given family.
Unlike in IPv4, the configuration of the IPv6 address and the default route is not automatic; it should be configured via cloud init, as shown below:
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n masquerade: {} # connect using masquerade mode\n ports:\n - port: 80 # allow incoming traffic on port 80 to get into the virtual machine\n networks:\n - name: red\n pod: {}\n
Note: The IPv6 address for the VM and default gateway must be the ones shown above.
masquerade mode can be used in IPv6 single stack clusters to provide a VM with an IPv6 only connectivity.
As with the IPv4 masquerade mode, the VM can be contacted using the pod's IP address - which will be in this case the IPv6 one. Outgoing traffic is also \"NAT'ed\" to the pod's respective IPv6 address.
As with the dual-stack cluster, the configuration of the IPv6 address and the default route is not automatic; it should be configured via cloud init, as shown in the dual-stack section.
Unlike the dual-stack cluster, which has a DHCP server for IPv4, the IPv6 single stack cluster has no DHCP server at all. Therefore, the VM won't have the search domains information and reaching a destination using its FQDN is not possible. Tracking issue - https://github.com/kubevirt/kubevirt/issues/7184
In sriov core network binding, SR-IOV Virtual Functions' PCI devices are directly exposed to virtual machines. SR-IOV device plugin and CNI can be used to manage SR-IOV devices in kubernetes, making them available for kubevirt to consume. The device is passed through into the guest operating system as a host device, using the vfio userspace interface, to maintain high networking performance.
"},{"location":"network/interfaces_and_networks/#how-to-expose-sr-iov-vfs-to-kubevirt","title":"How to expose SR-IOV VFs to KubeVirt","text":"
To simplify procedure, use the SR-IOV network operator to deploy and configure SR-IOV components in your cluster. On how to use the operator, please refer to their respective documentation.
Note: KubeVirt relies on VFIO userspace driver to pass PCI devices into VM guest. Because of that, when configuring SR-IOV operator policies, make sure you define a pool of VF resources that uses deviceType: vfio-pci.
"},{"location":"network/interfaces_and_networks/#start-an-sr-iov-vm","title":"Start an SR-IOV VM","text":"
Assuming that sriov-device-pluginand sriov-cni are deployed on the cluster nodes, create a network-attachment-definition CR as shown here. The name of the CR should correspond with the reference in the VM networks spec (see example below)
Finally, to create a VM that will attach to the aforementioned Network, refer to the following VM spec:
Note: for some NICs (e.g. Mellanox), the kernel module needs to be installed in the guest VM.
Note: Placement on dedicated CPUs can only be achieved if the Kubernetes CPU manager is running on the SR-IOV capable workers. For further details please refer to the dedicated cpu resources documentation.
MAC spoofing refers to the ability to generate traffic with an arbitrary source MAC address. An attacker may use this option to generate attacks on the network.
In order to protect against such scenarios, it is possible to enable the mac-spoof-check support in CNI plugins that support it.
The pod primary network which is served by the cluster network provider is not covered by this documentation. Please refer to the relevant provider to check how to enable spoofing check. The following text refers to the secondary networks, served using multus.
There are two known CNI plugins that support mac-spoof-check:
sriov-cni: Through the spoofchk parameter .
bridge-cni: Through the macspoofchk parameter.
The configuration is to be done on the NetworkAttachmentDefinition by the operator and any interface that refers to it, will have this feature enabled.
Below is an example of using the bridge CNI with macspoofchk enabled:
"},{"location":"network/interfaces_and_networks/#limitations-and-known-issues","title":"Limitations and known issues","text":""},{"location":"network/interfaces_and_networks/#invalid-cnis-for-secondary-networks","title":"Invalid CNIs for secondary networks","text":"
The following list of CNIs is known not to work for bridge interfaces - which are most common for secondary interfaces.
macvlan
ipvlan
The reason is similar: the bridge interface type moves the pod interface MAC address to the VM, leaving the pod interface with a different address. The aforementioned CNIs require the pod interface to have the original MAC address.
These issues are tracked individually:
macvlan
ipvlan
Feel free to discuss and / or propose fixes for them; we'd like to have these plugins as valid options on our ecosystem.
The bridge CNI supports mac-spoof-check through nftables, therefore the node must support nftables and have the nft binary deployed.
There are two methods for the MTU to be propagated to the guest interface.
Libvirt - for this the guest machine needs new enough virtio network driver that understands the data passed into the guest via a PCI config register in the emulated device.
DHCP - for this the guest DHCP client should be able to read the MTU from the DHCP server response.
On Windows guest non virtio interfaces, MTU has to be set manually using netsh or other tool since the Windows DHCP client doesn't request/read the MTU.
The table below is summarizing the MTU propagation to the guest.
masquerade bridge with CNI IP bridge with no CNI IP Windows virtio DHCP & libvirt DHCP & libvirt libvirt libvirt non-virtio DHCP DHCP X X
bridge with CNI IP - means the CNI gives IP to the pod interface and bridge binding is used to bind the pod interface to the guest.
Setting the networkInterfaceMultiqueue to true will enable the multi-queue functionality, increasing the number of vhost queue, for interfaces configured with a virtio model.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n networkInterfaceMultiqueue: true\n
Users of a Virtual Machine with multiple vCPUs may benefit of increased network throughput and performance.
Currently, the number of queues is being determined by the number of vCPUs of a VM. This is because multi-queue support optimizes RX interrupt affinity and TX queue selection in order to make a specific queue private to a specific vCPU.
Without enabling the feature, network performance does not scale as the number of vCPUs increases. Guests cannot transmit or retrieve packets in parallel, as virtio-net has only one TX and RX queue.
Virtio interfaces advertise on their status.interfaces.interface entry a field named queueCount. The queueCount field indicates how many queues were assigned to the interface. Queue count value is derived from the domain XML. In case the number of queues can't be determined (i.e interface that is reported by quest-agent only), it will be omitted.
NOTE: Although the virtio-net multiqueue feature provides a performance benefit, it has some limitations and therefore should not be unconditionally enabled
"},{"location":"network/interfaces_and_networks/#some-known-limitations","title":"Some known limitations","text":"
Guest OS is limited to ~200 MSI vectors. Each NIC queue requires a MSI vector, as well as any virtio device or assigned PCI device. Defining an instance with multiple virtio NICs and vCPUs might lead to a possibility of hitting the guest MSI limit.
virtio-net multiqueue works well for incoming traffic, but can occasionally cause a performance degradation, for outgoing traffic. Specifically, this may occur when sending packets under 1,500 bytes over the Transmission Control Protocol (TCP) stream.
Enabling virtio-net multiqueue increases the total network throughput, but in parallel it also increases the CPU consumption.
Enabling virtio-net multiqueue in the host QEMU config, does not enable the functionality in the guest OS. The guest OS administrator needs to manually turn it on for each guest NIC that requires this feature, using ethtool.
MSI vectors would still be consumed (wasted), if multiqueue was enabled in the host, but has not been enabled in the guest OS by the administrator.
In case the number of vNICs in a guest instance is proportional to the number of vCPUs, enabling the multiqueue feature is less important.
Each virtio-net queue consumes 64 KiB of kernel memory for the vhost driver.
NOTE: Virtio-net multiqueue should be enabled in the guest OS manually, using ethtool. For example: ethtool -L <NIC> combined #num_of_queues
More information please refer to KVM/QEMU MultiQueue.
"},{"location":"network/istio_service_mesh/","title":"Istio service mesh","text":"
Service mesh allows to monitor, visualize and control traffic between pods. Kubevirt supports running VMs as a part of Istio service mesh.
"},{"location":"network/istio_service_mesh/#create-a-virtualmachineinstance-with-enabled-istio-proxy-injecton","title":"Create a VirtualMachineInstance with enabled Istio proxy injecton","text":"
The example below specifies a VMI with masquerade network interface and sidecar.istio.io/inject annotation to register the VM to the service mesh.
Verify istio-proxy sidecar is deployed and able to synchronize with Istio control plane using istioctl proxy-status command. See Istio Debbuging Envoy and Istiod documentation section for more information about proxy-status subcommand.
"},{"location":"network/istio_service_mesh/#troubleshooting","title":"Troubleshooting","text":""},{"location":"network/istio_service_mesh/#istio-sidecar-is-not-deployed","title":"Istio sidecar is not deployed","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-jnw6p 2/2 Running 0 37s\n\n$ kubectl get pods virt-launcher-vmi-istio-jnw6p -o jsonpath='{.spec.containers[*].name}'\ncompute volumecontainerdisk\n
Resolution: Make sure the istio-injection=enabled is added to the target namespace. If the issue persists, consult relevant part of Istio documentation.
"},{"location":"network/istio_service_mesh/#istio-sidecar-is-not-ready","title":"Istio sidecar is not ready","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-lg5gp 2/3 Running 0 90s\n\n$ kubectl describe pod virt-launcher-vmi-istio-lg5gp\n ...\n Warning Unhealthy 2d8h (x3 over 2d8h) kubelet Readiness probe failed: Get \"http://10.244.186.222:15021/healthz/ready\": dial tcp 10.244.186.222:15021: connect: no route to host\n Warning Unhealthy 2d8h (x4 over 2d8h) kubelet Readiness probe failed: Get \"http://10.244.186.222:15021/healthz/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n
Resolution: Make sure the sidecar.istio.io/inject: \"true\" annotation is defined in the created VMI and that masquerade or passt binding is used for pod network interface.
"},{"location":"network/istio_service_mesh/#virt-launcher-pod-for-vmi-is-stuck-at-initialization-phase","title":"Virt-launcher pod for VMI is stuck at initialization phase","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-44mws 0/3 Init:0/3 0 29s\n\n$ kubectl describe pod virt-launcher-vmi-istio-44mws\n ...\n Multus: [default/virt-launcher-vmi-istio-44mws]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (istio-cni) in namespace (default): network-attachment-definitions.k8s.cni.cncf.io \"istio-cni\" not found\n
Resolution: Make sure the istio-cni NetworkAttachmentDefinition (provided in the Prerequisites section) is created in the target namespace.
A modular plugin which integrates with Kubevirt to implement a network binding.
Limited Support: Kubevirt provides regular support for the network binding plugin infrastructure for plugin authors. However, individual network plugin bindings are subject to limited, best-effort support from the Kubevirt community.
Clusters with Kubevirt deployments that utilize a network binding plugin should contact the plugin vendor for support on any issue that may be encountered, be it network or other issue.
In order to request support from the Kubevirt core project and its community, please use a setup without any network binding plugin. The plugin examples listed below are an exception to this rule, as they are maintained by the Kubevirt network core maintainers.
In order for a VM to have access to external network(s), several layers need to be defined and configured, depending on the connectivity characteristics needs.
These layers include:
Host connectivity: Network provider.
Host to Pod connectivity: CNI.
Pod to domain connectivity: Network Binding.
This guide focuses on the Network Binding portion.
The network bindings have been part of Kubevirt core API and codebase. With the increase of the number of network bindings added and frequent requests to tweak and change the existing network bindings, a decision has been made to create a network binding plugin infrastructure.
The plugin infrastructure provides means to compose a network binding plugin and integrate it into Kubevirt in a modular manner.
Kubevirt is providing several network binding plugins as references. The following plugins are available:
Depending on the plugin, some components need to be deployed in the cluster. Not all network binding plugins require all these components, therefore these steps are optional.
Binding CNI plugin: When it is required to change the pod network stack (and a core domain-attachment is not a fit), a custom CNI plugin is composed to serve the network binding plugin.
This binary needs to be deployed on each node of the cluster, like any other CNI plugin.
The binary can be built from source or consumed from an existing artifact.
Note: The location of the CNI plugins binaries depends on the platform used and its configuration. A frequently used path for such binaries is /opt/cni/bin/.
Binding NetworkAttachmentDefinition: It references the binding CNI plugin, with optional configuration settings. The manifest needs to be deployed on the cluster at a namespace which is accessible by the VM and its pod.
Note: It is possible to deploy the NetworkAttachmentDefinition on the default namespace, where all other namespaces can access it. Nevertheless, it is recommended (for security reasons) to define the NetworkAttachmentDefinition in the same namespace the VM resides.
Multus: In order for the network binding CNI and the NetworkAttachmentDefinition to operate, there is a need to have Multus deployed on the cluster. For more information, check the Quickstart Intallation Guide.
Sidecar image: When a core domain-attachment is not a fit, a sidecar is used to configure the vNIC domain configuration. In a more complex scenarios, the sidecar also runs services like DHCP to deliver IP information to the guest.
The sidecar image is built and usually pushed to an image registry for consumption. Therefore, the cluster needs to have access to the image.
The image can be built from source and pushed to an accessible registry or used from a given registry that already contains it.
Feature Gate The network binding plugin is currently (v1.1.0) in Alpha stage, protected by a feature gate (FG) named NetworkBindingPlugins.
It is therefore necessary to set the FG in the Kubevirt CR.
Example (valid when the FG subtree is already defined):
In order to use a network binding plugin, the cluster admin needs to register the binding. Registration includes the addition of the binding name with all its parameters to the Kubevirt CR.
The following (optional) parameters are currently supported:
Use the format to specify the NetworkAttachementDefinition that defines the CNI plugin and the configuration the binding plugin uses. Used when the binding plugin needs to change the pod network namespace."},{"location":"network/network_binding_plugins/#sidecarimage","title":"sidecarImage","text":"
From: v1.1.0
Specify a container image in a registry. Used when the binding plugin needs to modify the domain vNIC configuration or when a service needs to be executed (e.g. DHCP server).
The Domain Attachment type is a pre-defined core kubevirt method to attach an interface to the domain.
Specify the name of a core domain attachment type. A possible alternative to a sidecar, to configure the domain vNIC.
Supported types:
tap (from v1.1.1): The domain configuration is set to use an existing tap device. It also supports existing macvtap devices.
When both the domainAttachmentType and sidecarImage are specified, the domain will first be configured according to the domainAttachmentType and then the sidecarImage may modify it.
Specify whether the network binding plugin supports migration. It is possible to specify a migration method. Supported migration method types: - link-refresh (from v1.2.0): after migration, the guest nic will be deactivated and then activated again. It can be useful to renew the DHCP lease.
Note: In some deployments the Kubevirt CR is controlled by an external controller (e.g. HCO). In such cases, make sure to configure the wrapper operator/controller so the changes will get preserved.
Some plugins may need additional resources to be added to the compute container of the virt-launcher pod.
It is possible to specify compute resource overhead that will be added to the compute container of virt-launcher pods derived from virtual machines using the plugin.
Note: At the moment, only memory overhead requests are supported.
Note: In some deployments the Kubevirt CR is controlled by an external controller (e.g. HCO). In such cases, make sure to configure the wrapper operator/controller so the changes will get preserved.
Every compute container in a virt-launcher pod derived from a VM using the passt network binding plugin, will have an additional 500Mi memory overhead.
When configuring the VM/VMI network interface, the binding plugin name can be specified. If it exists in the Kubevirt CR, it will be used to setup the network interface.
Before creating NetworkPolicy objects, make sure you are using a networking solution which supports NetworkPolicy. Network isolation is controlled entirely by NetworkPolicy objects. By default, all vmis in a namespace are accessible from other vmis and network endpoints. To isolate one or more vmis in a project, you can create NetworkPolicy objects in that namespace to indicate the allowed incoming connections.
Note: vmis and pods are treated equally by network policies, since labels are passed through to the pods which contain the running vmi. With other words, labels on vmis can be matched by spec.podSelector on the policy.
"},{"location":"network/networkpolicy/#create-networkpolicy-to-deny-all-traffic","title":"Create NetworkPolicy to Deny All Traffic","text":"
To make a project \"deny by default\" add a NetworkPolicy object that matches all vmis but accepts no traffic.
"},{"location":"network/networkpolicy/#create-networkpolicy-to-only-accept-connections-from-vmis-within-namespaces","title":"Create NetworkPolicy to only Accept connections from vmis within namespaces","text":"
To make vmis accept connections from other vmis in the same namespace, but reject all other connections from vmis in other namespaces:
"},{"location":"network/networkpolicy/#create-networkpolicy-to-only-allow-http-and-https-traffic","title":"Create NetworkPolicy to only allow HTTP and HTTPS traffic","text":"
To enable only HTTP and HTTPS access to the vmis, add a NetworkPolicy object similar to:
"},{"location":"network/networkpolicy/#create-networkpolicy-to-deny-traffic-by-labels","title":"Create NetworkPolicy to deny traffic by labels","text":"
To make one specific vmi with a label type: test to reject all traffic from other vmis, create:
Once the VirtualMachineInstance is started, in order to connect to a VirtualMachineInstance, you can create a Service object for a VirtualMachineInstance. Currently, three types of service are supported: ClusterIP, NodePort and LoadBalancer. The default type is ClusterIP.
Note: Labels on a VirtualMachineInstance are passed through to the pod, so simply add your labels for service creation to the VirtualMachineInstance. From there on it works like exposing any other k8s resource, by referencing these labels in a service.
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-clusterip-service","title":"Expose VirtualMachineInstance as a ClusterIP Service","text":"
Give a VirtualMachineInstance with the label special: key:
Notes: * If --target-port is not set, it will be take the same value as --port * The cluster IP is usually allocated automatically, but it may also be forced into a value using the --cluster-ip flag (assuming value is in the valid range and not taken)
Query the service object:
$ kubectl get service\nNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\nvmiservice ClusterIP 172.30.3.149 <none> 27017/TCP 2m\n
You can connect to the VirtualMachineInstance by service IP and service port inside the cluster network:
$ ssh cirros@172.30.3.149 -p 27017\n
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-nodeport-service","title":"Expose VirtualMachineInstance as a NodePort Service","text":"
Expose the SSH port (22) of a VirtualMachineInstance running on KubeVirt by creating a NodePort service:
Notes: * If --node-port is not set, its value will be allocated dynamically (in the range above 30000) * If the --node-port value is set, it must be unique across all services
The service can be listed by querying for the service objects:
$ kubectl get service\nNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\nnodeport NodePort 172.30.232.73 <none> 27017:30000/TCP 5m\n
Connect to the VirtualMachineInstance by using a node IP and node port outside the cluster network:
$ ssh cirros@$NODE_IP -p 30000\n
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-loadbalancer-service","title":"Expose VirtualMachineInstance as a LoadBalancer Service","text":"
Expose the RDP port (3389) of a VirtualMachineInstance running on KubeVirt by creating LoadBalancer service. Here is an example:
With the macvtap binding plugin, virtual machines are directly exposed to the Kubernetes nodes L2 network. This is achieved by 'extending' an existing network interface with a virtual device that has its own MAC address.
Its main benefits are:
Direct connection to the node nic with no intermediate bridges.
"},{"location":"network/net_binding_plugins/macvtap/#functionality-support","title":"Functionality support","text":"Functionality Support Run without extra capabilities (on pod) Yes Migration support No IPAM support (on pod) No Primary network (pod network) No Secondary network Yes"},{"location":"network/net_binding_plugins/macvtap/#known-issues","title":"Known Issues","text":"
Live migration is not fully supported, see issue #5912
Warning: On KinD clusters, the user needs to adjust the cluster configuration, mounting dev of the running host onto the KinD nodes, because of a known issue.
In order to use macvtap, the following points need to be covered:
Deploy the CNI plugin binary on the nodes.
Deploy the Device Plugin daemon on the nodes.
Configure which node interfaces are exposed.
Define a NetworkAttachmentDefinition that points to the CNI plugin.
"},{"location":"network/net_binding_plugins/macvtap/#macvtap-cni-and-dp-deployment-on-nodes","title":"Macvtap CNI and DP deployment on nodes","text":"
To simplify the procedure, use the Cluster Network Addons Operator to deploy and configure the macvtap components in your cluster.
The aforementioned operator effectively deploys the macvtap cni and device plugin.
"},{"location":"network/net_binding_plugins/macvtap/#expose-node-interface-to-the-macvtap-device-plugin","title":"Expose node interface to the macvtap device plugin","text":"
There are two different alternatives to configure which host interfaces get exposed to the user, enabling them to create macvtap interfaces on top of:
select the host interfaces: indicates which host interfaces are exposed.
expose all interfaces: all interfaces of all hosts are exposed.
Both options are configured via the macvtap-deviceplugin-config ConfigMap, and more information on how to configure it can be found in the macvtap-cni repo.
This is a minimal example, in which the eth0 interface of the Kubernetes nodes is exposed, via the lowerDevice attribute.
This step can be omitted, since the default configuration of the aforementioned ConfigMap is to expose all host interfaces (which is represented by the following configuration):
The object should be created in a \"default\" namespace where all other namespaces can access, or, in the same namespace the VMs reside in.
The requested k8s.v1.cni.cncf.io/resourceName annotation must point to an exposed host interface (via the lowerDevice attribute, on the macvtap-deviceplugin-configConfigMap).
The binding plugin replaces the experimental core macvtap binding implementation (including its API).
Note: The network binding plugin infrastructure and the macvtap plugin specifically are in Alpha stage. Please use them with care, preferably on a non-production deployment.
The macvtap binding plugin consists of the following components:
Macvtap CNI plugin.
The plugin needs to:
Enable the network binding plugin framework FG.
Register the binding plugin on the Kubevirt CR.
Reference the network binding by name from the VM spec interface.
Note: The specific macvtap plugin has no FG by its own. It is up to the cluster admin to decide if the plugin is to be available in the cluster. The macvtap binding is still in evaluation, use it with care.
Plug A Simple Socket Transport is an enhanced alternative to SLIRP, providing user-space network connectivity.
passt is a universal tool which implements a translation layer between a Layer-2 network interface and native Layer -4 sockets (TCP, UDP, ICMP/ICMPv6 echo) on a host.
Its main benefits are:
Doesn't require extra network capabilities as CAP_NET_RAW and CAP_NET_ADMIN.
Allows integration with service meshes (which expect applications to run locally) out of the box.
Supports IPv6 out of the box (in contrast to the existing bindings which require configuring IPv6 manually).
"},{"location":"network/net_binding_plugins/passt/#functionality-support","title":"Functionality support","text":"Functionality Support Migration support Yes Service Mesh support Yes Pod IP in guest Yes Custom CIDR in guest No Require extra capabilities (on pod) to operate No Primary network (pod network) Yes Secondary network No"},{"location":"network/net_binding_plugins/passt/#node-optimization-requirementsrecommendations","title":"Node optimization requirements/recommendations:","text":"
To get better performance the node should be configured with:
To run multiple passt VMs with no explicit ports, the node's fs.file-max should be increased (for a VM forwards all IPv4 and IPv6 ports, for TCP and UDP, passt needs to create ~2^18 sockets):
sysctl -w fs.file-max = 9223372036854775807\n
NOTE: To achieve optimal memory consumption with Passt binding, specify ports required for your workload. When no ports are explicitly specified, all ports are forwarded, leading to memory overhead of up to 800 Mi.
The binding plugin replaces the experimental core passt binding implementation (including its API).
Note: The network binding plugin infrastructure and the passt plugin specifically are in Alpha stage. Please use them with care, preferably on a non-production deployment.
The passt binding plugin consists of the following components:
Passt CNI plugin.
Sidecar image.
As described in the definition & flow section, the passt plugin needs to:
Deploy the CNI plugin binary on the nodes.
Define a NetworkAttachmentDefinition that points to the CNI plugin.
Assure access to the sidecar image.
Enable the network binding plugin framework FG.
Register the binding plugin on the Kubevirt CR.
Reference the network binding by name from the VM spec interface.
And in detail:
"},{"location":"network/net_binding_plugins/passt/#passt-cni-deployment-on-nodes","title":"Passt CNI deployment on nodes","text":"
The CNI plugin binary can be retrieved directly from the kubevirt release assets (on GitHub) or to be built from its sources.
Note: The kubevirt project uses Bazel to build the binaries and container images. For more information in how to build the whole project, visit the developer getting started guide.
Once the binary is ready, you may rename it to a meaningful name (e.g. kubevirt-passt-binding). This name is used in the NetworkAttachmentDefinition configuration.
Copy the binary to each node in your cluster. The location of the CNI plugins may vary between platforms and versions. One common path is /opt/cni/bin/.
Note: The specific passt plugin has no FG by its own. It is up to the cluster admin to decide if the plugin is to be available in the cluster. The passt binding is still in evaluation, use it with care.
The clone.kubevirt.io API Group defines resources for cloning KubeVirt objects. Currently, the only supported cloning type is VirtualMachine, but more types are planned to be supported in the future (see future roadmap below).
Please bear in mind that the clone API is in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.
Under the hood, the clone API relies upon Snapshot & Restore APIs. Therefore, in order to be able to use the clone API, please see Snapshot & Restore prerequisites.
Firstly, as written above, the clone API relies upon Snapshot & Restore APIs under the hood. Therefore, it might be helpful to look at Snapshot & Restore user-guide page for more info.
The source and target indicate the source/target API group, kind and name. A few important notes:
Currently, the only supported kinds are VirtualMachine (of kubevirt.io api group) and VirtualMachineSnapshot ( of snapshot.kubevirt.io api group), but more types are expected to be supported in the future. See \"future roadmap\" below for more info.
The target name is optional. If unspecified, the clone controller will generate a name for the target automatically.
The target and source must reside in the same namespace.
These spec fields are intended to determine which labels / annotations are being copied to the target or stripped away.
The filters are a list of strings. Each string represents a key that may exist at the source. Every source key that matches to one of these values is being copied to the cloned target. In addition, special regular-expression-like characters can be used:
Wildcard character (*) can be used to match anything. Wildcard can be only used at the end of the filter.
These filters are valid:
\"*\"
\"some/key*\"
These filters are invalid:
\"some/*/key\"
\"*/key\"
Negation character (!) can be used to avoid matching certain keys. Negation can be only used at the beginning of a filter. Note that a Negation and Wildcard can be used together.
These filters are valid:
\"!some/key\"
\"!some/*\"
These filters are invalid:
\"key!\"
\"some/!key\"
Setting label / annotation filters is optional. If unset, all labels / annotations will be copied as a default.
Some network CNIs such as Kube-OVN or OVN-Kubernetes inject network information into the annotations of a VM. When cloning a VM from a target VM the cloned VM will use the same network. To avoid this you can use template labels and annotation filters.
This field is used to explicitly replace MAC addresses for certain interfaces. The field is a string to string map; the keys represent interface names and the values represent the new MAC address for the clone target.
This field is optional. By default, all mac addresses are stripped out. This suits situations when kube-mac-pool is deployed in the cluster which would automatically assign the target with a fresh valid MAC address.
The clone API is in an early alpha version and may change dramatically. There are many improvements and features that are expected to be added, the most significant goals are:
Add more supported source types like VirtualMachineInstace in the future.
Add a cross-namespace clone support. This needs to be supported for snapshots / restores first.
"},{"location":"storage/clone_api/#using-clones-as-a-golden-vm-image","title":"Using clones as a \"golden VM image\"","text":"
One of the great things that could be accomplished with the clone API when the source is of kind VirtualMachineSnapshot is to create \"golden VM images\" (a.k.a. Templates / Bookmark VMs / etc). In other words, the following workflow would be available:
Create a golden image
Create a VM
Prepare a \"golden VM\" environment
This can mean different things in different contexts. For example, write files, install applications, apply configurations, or anything else.
Snapshot the VM
Delete the VM
Then, this \"golden image\" can be duplicated as many times as needed. To instantiate a VM from the snapshot:
Create a Clone object where the source would point to the previously taken snapshot
Create as many VMs you need
This feature is still under discussions and may be implemented differently then explained here.
"},{"location":"storage/containerized_data_importer/","title":"Containerized Data Importer","text":"
The Containerized Data Importer (CDI) project provides facilities for enabling Persistent Volume Claims (PVCs) to be used as disks for KubeVirt VMs by way of DataVolumes. The three main CDI use cases are:
Import a disk image from a web server or container registry to a DataVolume
Clone an existing PVC to a DataVolume
Upload a local disk image to a DataVolume
This document deals with the third use case. So you should have CDI installed in your cluster, a VM disk that you'd like to upload, and virtctl in your path.
CDI supports the raw and qcow2 image formats which are supported by qemu. See the qemu documentation for more details. Bootable ISO images can also be used and are treated like raw images. Images may be compressed with either the gz or xz format.
The example in this document uses this CirrOS image
virtctl has an image-upload command with the following options:
virtctl image-upload --help\nUpload a VM image to a DataVolume/PersistentVolumeClaim.\n\nUsage:\n virtctl image-upload [flags]\n\nExamples:\n # Upload a local disk image to a newly created DataVolume:\n virtctl image-upload dv fedora-dv --size=10Gi --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to an existing DataVolume\n virtctl image-upload dv fedora-dv --no-create --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to a newly created PersistentVolumeClaim\n virtctl image-upload pvc fedora-pvc --size=10Gi --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to a newly created PersistentVolumeClaim and label it with a default instance type and preference\n virtctl image-upload pvc fedora-pvc --size=10Gi --image-path=/images/fedora30.qcow2 --default-instancetype=n1.medium --default-preference=fedora\n\n # Upload a local disk image to an existing PersistentVolumeClaim\n virtctl image-upload pvc fedora-pvc --no-create --image-path=/images/fedora30.qcow2\n\n # Upload to a DataVolume with explicit URL to CDI Upload Proxy\n virtctl image-upload dv fedora-dv --uploadproxy-url=https://cdi-uploadproxy.mycluster.com --image-path=/images/fedora30.qcow2\n\n # Upload a local disk archive to a newly created DataVolume:\n virtctl image-upload dv fedora-dv --size=10Gi --archive-path=/images/fedora30.tar\n\nFlags:\n --access-mode string The access mode for the PVC.\n --archive-path string Path to the local archive.\n --default-instancetype string The default instance type to associate with the image.\n --default-instancetype-kind string The default instance type kind to associate with the image.\n --default-preference string The default preference to associate with the image.\n --default-preference-kind string The default preference kind to associate with the image.\n --force-bind Force bind the PVC, ignoring the WaitForFirstConsumer logic.\n -h, --help help for image-upload\n --image-path string Path to the local VM image.\n --insecure Allow insecure server connections when using HTTPS.\n --no-create Don't attempt to create a new DataVolume/PVC.\n --size string The size of the DataVolume to create (ex. 10Gi, 500Mi).\n --storage-class string The storage class for the PVC.\n --uploadproxy-url string The URL of the cdi-upload proxy service.\n --volume-mode string Specify the VolumeMode (block/filesystem) used to create the PVC. Default is the storageProfile default. For archive upload default is filesystem.\n --wait-secs uint Seconds to wait for upload pod to start. (default 300)\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
virtctl image-upload works by creating a DataVolume of the requested size, sending an UploadTokenRequest to the cdi-apiserver, and uploading the file to the cdi-uploadproxy.
virtctl image-upload dv cirros-vm-disk --size=500Mi --image-path=/home/mhenriks/images/cirros-0.4.0-x86_64-disk.img --uploadproxy-url=<url to upload proxy service>\n
"},{"location":"storage/containerized_data_importer/#addressing-certificate-issues-when-uploading-images","title":"Addressing Certificate Issues when Uploading Images","text":"
Issues with the certificates can be circumvented by using the --insecure flag to prevent the virtctl command from verifying the remote host. It is better to resolve certificate issues that prevent uploading images using the virtctl image-upload command and not use the --insecure flag.
The following are some common issues with certificates and some easy ways to fix them.
"},{"location":"storage/containerized_data_importer/#does-not-contain-any-ip-sans","title":"Does not contain any IP SANs","text":"
This issue happens when trying to upload images using an IP address instead of a resolvable name. For example, trying to upload to the IP address 192.168.39.32 at port 31001 would produce the following error.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://192.168.39.32:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://192.168.39.32:31001\n\n 0 B / 193.89 MiB [-------------------------------------------------------] 0.00% 0s\n\nPost https://192.168.39.32:31001/v1beta1/upload: x509: cannot validate certificate for 192.168.39.32 because it doesn't contain any IP SANs\n
It is easily fixed by adding an entry it your local name resolution service. This could be a DNS server or the local hosts file. The URL used to upload the proxy should be changed to reflect the resolvable name.
The Subject and the Subject Alternative Name in the certificate contain valid names that can be used for resolution. Only one of these names needs to be resolvable. Use the openssl command to view the names of the cdi-uploadproxy service.
Adding the following entry to the /etc/hosts file, if it provides name resolution, should fix this issue. Any service that provides name resolution for the system could be used.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 193.89 MiB / 193.89 MiB [=============================================] 100.00% 1m38s\n\nUploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress\nProcessing completed successfully\nUploading Fedora-Cloud-Base-33-1.2.x86_64.raw.xz completed successfully\n
"},{"location":"storage/containerized_data_importer/#certificate-signed-by-unknown-authority","title":"Certificate Signed by Unknown Authority","text":"
This happens because the cdi-uploadproxy certificate is self signed and the system does not trust the cdi-uploadproxy as a Certificate Authority.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 0 B / 193.89 MiB [-------------------------------------------------------] 0.00% 0s\n\nPost https://cdi-uploadproxy:31001/v1beta1/upload: x509: certificate signed by unknown authority\n
This can be fixed by adding the certificate to the systems trust store. Download the cdi-uploadproxy-server-cert.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 193.89 MiB / 193.89 MiB [=============================================] 100.00% 1m36s\n\nUploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress\nProcessing completed successfully\nUploading Fedora-Cloud-Base-33-1.2.x86_64.raw.xz completed successfully\n
"},{"location":"storage/containerized_data_importer/#setting-the-url-of-the-cdi-upload-proxy-service","title":"Setting the URL of the cdi-upload Proxy Service","text":"
Setting the URL for the cdi-upload proxy service allows the virtctl image-upload command to upload the images without specifying the --uploadproxy-url flag. Permanently setting the URL is done by patching the CDI configuration.
The following will set the default upload proxy to use port 31001 of cdi-uploadproxy. An IP address could also be used instead of the dns name.
See the section Addressing Certificate Issues when Uploading for why cdi-uploadproxy was chosen and issues that can be encountered when using an IP address.
"},{"location":"storage/containerized_data_importer/#connect-to-virtualmachineinstance-console","title":"Connect to VirtualMachineInstance console","text":"
Use virtctl to connect to the newly create VirtualMachineInstance.
virtctl console cirros-vm\n
"},{"location":"storage/disks_and_volumes/","title":"Filesystems, Disks and Volumes","text":"
Making persistent storage in the cluster (volumes) accessible to VMs consists of three parts. First, volumes are specified in spec.volumes. Second, disks are added to the VM by specifying them in spec.domain.devices.disks. Finally, a reference to the specified volume is added to the disk specification by name.
Like all other vmi devices a spec.domain.devices.disks element has a mandatory name, and furthermore, the disk's name must reference the name of a volume inside spec.volumes.
A disk can be made accessible via four different types:
lun
disk
cdrom
fileystems
All possible configuration options are available in the Disk API Reference.
All types allow you to specify the bus attribute. The bus attribute determines how the disk will be presented to the guest operating system.
It is possible to reserve a LUN through the the SCSI Persistent Reserve commands. In order to issue privileged SCSI ioctls, the VM requires activation of the persistent resevation flag:
Note: The persistent reservation feature enables an additional privileged component to be deployed together with virt-handler. Because this feature allows for sensitive security procedures, it is disabled by default and requires cluster administrator configuration.
A disk disk will expose the volume as an ordinary disk to the VM.
A minimal example which attaches a PersistentVolumeClaim named mypvc as a disk device to the VM:
metadata:\n name: testvmi-disk\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a disk\n disk: {}\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
You can set the disk bus type, overriding the defaults, which in turn depends on the chipset the VM is configured to use:
metadata:\n name: testvmi-disk\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a disk\n disk:\n # This makes it exposed as /dev/vda, being the only and thus first\n # disk attached to the VM\n bus: virtio\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
A cdrom disk will expose the volume as a cdrom drive to the VM. It is read-only by default.
A minimal example which attaches a PersistentVolumeClaim named mypvc as a cdrom device to the VM:
metadata:\n name: testvmi-cdrom\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a cdrom\n cdrom:\n # This makes the cdrom writeable\n readonly: false\n # This makes the cdrom be exposed as SATA device\n bus: sata\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
A filesystem device will expose the volume as a filesystem to the VM. filesystems rely on virtiofs to make visible external filesystems to KubeVirt VMs. Further information about virtiofs can be found at the Official Virtiofs Site.
Compared with disk, filesystems allow changes in the source to be dynamically reflected in the volumes inside the VM. For instance, if a given configMap is shared with filesystems any change made on it will be reflected in the VMs. However, it is important to note that filesystems do not allow live migration.
Additionally, filesystem devices must be mounted inside the VM. This can be done through cloudInitNoCloud or manually connecting to the VM shell and targeting the same command. The main challenge is to understand how the device tag used to identify the new filesystem and mount it with the mount -t virtiofs [device tag] [path] command. For that purpose, the tag is assigned to the filesystem in the VM spec spec.domain.devices.filesystems.name. For instance, if in a given VM spec is spec.domain.devices.filesystems.name: foo, the required command inside the VM to mount the filesystem in the /tmp/foo path will be mount -t virtiofs foo /tmp/foo:
Note: As stated, filesystems rely on virtiofs. Moreover, virtiofs requires kernel linux support to work in the VM. To check if the linux image of the VM has the required support, you can address the following command: modprobe virtiofs. If the command output is modprobe: FATAL: Module virtiofs not found, the linux image of the VM does not support virtiofs. Also, you can check if the kernel version is up to 5.4 in any linux distribution or up to 4.18 in centos/rhel. To check this, you can target the following command: uname -r.
Refer to section Sharing Directories with VMs for usage examples of filesystems.
The error policy controls how the hypervisor should behave when an IO error occurs on a disk read or write. The default behaviour is to stop the guest and a Kubernetes event is generated. However, it is possible to change the value to either:
report: the error is reported in the guest
ignore: the error is ignored, but the read/write failure goes undetected
enospace: error when there isn't enough space on the disk
The error policy can be specified per disk or lun.
Allows attaching cloudInitNoCloud data-sources to the VM. If the VM contains a proper cloud-init setup, it will pick up the disk as a user-data source.
A simple example which attaches a Secret as a cloud-init disk datasource may look like this:
Allows attaching cloudInitConfigDrive data-sources to the VM. If the VM contains a proper cloud-init setup, it will pick up the disk as a user-data source.
A simple example which attaches a Secret as a cloud-init disk datasource may look like this:
Allows connecting a PersistentVolumeClaim to a VM disk.
Use a PersistentVolumeClaim when the VirtualMachineInstance's disk needs to persist after the VM terminates. This allows for the VM's data to remain persistent between restarts.
A PersistentVolume can be in \"filesystem\" or \"block\" mode:
Filesystem: For KubeVirt to be able to consume the disk present on a PersistentVolume's filesystem, the disk must be named disk.img and be placed in the root path of the filesystem. Currently the disk is also required to be in raw format. > Important: The disk.img image file needs to be owned by the user-id 107 in order to avoid permission issues.
Note: If the disk.img image file has not been created manually before starting a VM then it will be created automatically with the PersistentVolumeClaim size. Since not every storage provisioner provides volumes with the exact usable amount of space as requested (e.g. due to filesystem overhead), KubeVirt tolerates up to 10% less available space. This can be configured with the developerConfiguration.pvcTolerateLessSpaceUpToPercent value in the KubeVirt CR (kubectl edit kubevirt kubevirt -n kubevirt).
Block: Use a block volume for consuming raw block devices. Note: you need to enable the BlockVolume feature gate.
A simple example which attaches a PersistentVolumeClaim as a disk may look like this:
"},{"location":"storage/disks_and_volumes/#thick-and-thin-volume-provisioning","title":"Thick and thin volume provisioning","text":"
Sparsification can make a disk thin-provisioned, in other words it allows to convert the freed space within the disk image into free space back on the host. The fstrim utility can be used on a mounted filesystem to discard the blocks not used by the filesystem. In order to be able to sparsify a disk inside the guest, the disk needs to be configured in the libvirt xml with the option discard=unmap. In KubeVirt, every disk is passed as default with this option enabled. It is possible to check if the trim configuration is supported in the guest by runninglsblk -D, and check the discard options supported on every disk.
However, in certain cases like preallocaton or when the disk is thick provisioned, the option needs to be disabled. The disk's PVC has to be marked with an annotation that contains /storage.preallocation or /storage.thick-provisioned, and set to true. If the volume is preprovisioned using CDI and the preallocation is enabled, then the PVC is automatically annotated with: cdi.kubevirt.io/storage.preallocation: true and the discard passthrough option is disabled.
Example of a PVC definition with the annotation to disable discard passthrough:
For some storage methods, Kubernetes may support expanding storage in-use (allowVolumeExpansion feature). KubeVirt can respond to it by making the additional storage available for the virtual machines. This feature is currently off by default, and requires enabling a feature gate. To enable it, add the ExpandDisks feature gate in the kubevirt object:
Enabling this feature does two things: - Notify the virtual machine about size changes - If the disk is a Filesystem PVC, the matching file is expanded to the remaining size (while reserving some space for file system overhead).
To use an externally managed local block device from a host ( e.g. /dev/sdb , zvol, LVM, etc... ) in a VM directly, you would need a provisioner that supports block devices, such as OpenEBS LocalPV.
Alternatively, local volumes can be provisioned by hand. I.e. the following PVC:
DataVolumes are a way to automate importing virtual machine disks onto PVCs during the virtual machine's launch flow. Without using a DataVolume, users have to prepare a PVC with a disk image before assigning it to a VM or VMI manifest. With a DataVolume, both the PVC creation and import is automated on behalf of the user.
"},{"location":"storage/disks_and_volumes/#datavolume-vm-behavior","title":"DataVolume VM Behavior","text":"
DataVolumes can be defined in the VM spec directly by adding the DataVolumes to the dataVolumeTemplates list. Below is an example.
You can see the DataVolume defined in the dataVolumeTemplates section has two parts. The source and pvc
The source part declares that there is a disk image living on an http server that we want to use as a volume for this VM. The pvc part declares the spec that should be used to create the PVC that hosts the source data.
When this VM manifest is posted to the cluster, as part of the launch flow a PVC will be created using the spec provided and the source data will be automatically imported into that PVC before the VM starts. When the VM is deleted, the storage provisioned by the DataVolume will automatically be deleted as well.
For a VMI object, DataVolumes can be referenced as a volume source for the VMI. When this is done, it is expected that the referenced DataVolume exists in the cluster. The VMI will consume the DataVolume, but the DataVolume's life-cycle will not be tied to the VMI.
Below is an example of a DataVolume being referenced by a VMI. It is expected that the DataVolume alpine-datavolume was created prior to posting the VMI manifest to the cluster. It is okay to post the VMI manifest to the cluster while the DataVolume is still having data imported. KubeVirt knows not to start the VMI until all referenced DataVolumes have finished their clone and import phases.
A DataVolume is a custom resource provided by the Containerized Data Importer (CDI) project. KubeVirt integrates with CDI in order to provide users a workflow for dynamically creating PVCs and importing data into those PVCs.
In order to take advantage of the DataVolume volume source on a VM or VMI, CDI must be installed.
Installing CDI
Go to the CDI release page
Pick the latest stable release and post the corresponding cdi-controller-deployment.yaml manifest to your cluster.
An ephemeral volume is a local COW (copy on write) image that uses a network volume as a read-only backing store. With an ephemeral volume, the network backing store is never mutated. Instead all writes are stored on the ephemeral image which exists on local storage. KubeVirt dynamically generates the ephemeral images associated with a VM when the VM starts, and discards the ephemeral images when the VM stops.
Ephemeral volumes are useful in any scenario where disk persistence is not desired. The COW image is discarded when VM reaches a final state (e.g., succeeded, failed).
Currently, only PersistentVolumeClaim may be used as a backing store of the ephemeral volume.
Up-to-date information on supported backing stores can be found in the KubeVirt API.
containerDisk was originally registryDisk, please update your code when needed.
The containerDisk feature provides the ability to store and distribute VM disks in the container image registry. containerDisks can be assigned to VMs in the disks section of the VirtualMachineInstance spec.
No network shared storage devices are utilized by containerDisks. The disks are pulled from the container registry and reside on the local node hosting the VMs that consume the disks.
"},{"location":"storage/disks_and_volumes/#when-to-use-a-containerdisk","title":"When to use a containerDisk","text":"
containerDisks are ephemeral storage devices that can be assigned to any number of active VirtualMachineInstances. This makes them an ideal tool for users who want to replicate a large number of VM workloads that do not require persistent data. containerDisks are commonly used in conjunction with VirtualMachineInstanceReplicaSets.
"},{"location":"storage/disks_and_volumes/#when-not-to-use-a-containerdisk","title":"When Not to use a containerDisk","text":"
containerDisks are not a good solution for any workload that requires persistent root disks across VM restarts.
Users can inject a VirtualMachineInstance disk into a container image in a way that is consumable by the KubeVirt runtime. Disks must be placed into the /disk directory inside the container. Raw and qcow2 formats are supported. Qcow2 is recommended in order to reduce the container image's size. containerdisks can and should be based on scratch. No content except the image is required.
Note: Prior to kubevirt 0.20, the containerDisk image needed to have kubevirt/container-disk-v1alpha as base image.
Note: The containerDisk needs to be readable for the user with the UID 107 (qemu).
Example: Inject a local VirtualMachineInstance disk into a container image.
Note that a containerDisk is file-based and therefore cannot be attached as a lun device to the VM.
"},{"location":"storage/disks_and_volumes/#custom-disk-image-path","title":"Custom disk image path","text":"
ContainerDisk also allows to store disk images in any folder, when required. The process is the same as previous. The main difference is, that in custom location, kubevirt does not scan for any image. It is your responsibility to provide full path for the disk image. Providing image path is optional. When no path is provided, kubevirt searches for disk images in default location: /disk.
An emptyDisk works similar to an emptyDir in Kubernetes. An extra sparse qcow2 disk will be allocated and it will live as long as the VM. Thus it will survive guest side VM reboots, but not a VM re-creation. The disk capacity needs to be specified.
Example: Boot cirros with an extra emptyDisk with a size of 2GiB:
"},{"location":"storage/disks_and_volumes/#when-to-use-an-emptydisk","title":"When to use an emptyDisk","text":"
Ephemeral VMs very often come with read-only root images and limited tmpfs space. In many cases this is not enough to install application dependencies and provide enough disk space for the application data. While this data is not critical and thus can be lost, it is still needed for the application to function properly during its lifetime. This is where an emptyDisk can be useful. An emptyDisk is often used and mounted somewhere in /var/lib or /var/run.
A hostDisk volume type provides the ability to create or use a disk image located somewhere on a node. It works similar to a hostPath in Kubernetes and provides two usage types:
DiskOrCreate if a disk image does not exist at a given location then create one
Disk a disk image must exist at a given location
Note: you need to enable the HostDisk feature gate.
Example: Create a 1Gi disk image located at /data/disk.img and attach it to a VM.
A configMap is a reference to a ConfigMap in Kubernetes. A configMap can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk","title":"As a disk","text":"
By using disk, an extra iso disk will be allocated which has to be mounted on a VM. To mount the configMap users can use cloudInit and the disk's serial number. The name needs to be set for a reference to the created kubernetes ConfigMap.
Note: Currently, ConfigMap update is not propagate into the VMI. If a ConfigMap is updated, only a pod will be aware of changes, not running VMIs.
Note: Due to a Kubernetes CRD issue, you cannot control the paths within the volume where ConfigMap keys are projected.
Example: Attach the configMap to a VM and use cloudInit to mount the iso disk:
"},{"location":"storage/disks_and_volumes/#as-a-filesystem","title":"As a filesystem","text":"
By using filesystem, configMaps are shared through virtiofs. In contrast with using disk for sharing configMaps, filesystem allows you to dynamically propagate changes on configMaps to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given configMap, the following VM definition could be used:
A secret is a reference to a Secret in Kubernetes. A secret can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk_1","title":"As a disk","text":"
By using disk, an extra iso disk will be allocated which has to be mounted on a VM. To mount the secret users can use cloudInit and the disks serial number. The secretName needs to be set for a reference to the created kubernetes Secret.
Note: Currently, Secret update propagation is not supported. If a Secret is updated, only a pod will be aware of changes, not running VMIs.
Note: Due to a Kubernetes CRD issue, you cannot control the paths within the volume where Secret keys are projected.
Example: Attach the secret to a VM and use cloudInit to mount the iso disk:
"},{"location":"storage/disks_and_volumes/#as-a-filesystem_1","title":"As a filesystem","text":"
By using filesystem, secrets are shared through virtiofs. In contrast with using disk for sharing secrets, filesystem allows you to dynamically propagate changes on secrets to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given secret, the following VM definition could be used:
A serviceAccount volume references a Kubernetes ServiceAccount. A serviceAccount can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk_2","title":"As a disk","text":"
By using disk, a new iso disk will be allocated with the content of the service account (namespace, token and ca.crt), which needs to be mounted in the VM. For automatic mounting, see the configMap and secret examples above.
Note: Currently, ServiceAccount update propagation is not supported. If a ServiceAccount is updated, only a pod will be aware of changes, not running VMIs.
"},{"location":"storage/disks_and_volumes/#as-a-filesystem_2","title":"As a filesystem","text":"
By using filesystem, serviceAccounts are shared through virtiofs. In contrast with using disk for sharing serviceAccounts, filesystem allows you to dynamically propagate changes on serviceAccounts to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given serviceAccount, the following VM definition could be used:
downwardMetrics expose a limited set of VM and host metrics to the guest. The format is compatible with vhostmd.
Getting a limited set of host and VM metrics is in some cases required to allow third-parties diagnosing performance issues on their appliances. One prominent example is SAP HANA.
In order to expose downwardMetrics to VMs, the methods disk and virtio-serial port are supported.
Note: The DownwardMetrics feature gate must be enabled to use the metrics. Available starting with KubeVirt v0.42.0.
This method uses a virtio-serial port to expose the metrics data to the VM. KubeVirt creates a port named /dev/virtio-ports/org.github.vhostmd.1 inside the VM, in which the Virtio Transport protocol is supported. downwardMetrics can be retrieved from this port. See vhostmd documentation under Virtio Transport for further information.
To expose the metrics using a virtio-serial port, a downwardMetrics device must be added (i.e., spec.domain.devices.downwardMetrics: {}).
vm-dump-metrics is useful as a standalone tool to verify the serial port is working and to inspect the metrics. However, applications that consume metrics will usually connect to the virtio-serial port themselves.
Note: The tool vm-dump-metrics provides the option --virtio in case the virtio-serial port is used. Please, refer to vm-dump-metrics --help for further information.
Libvirt has the ability to use IOThreads for dedicated disk access (for supported devices). These are dedicated event loop threads that perform block I/O requests and improve scalability on SMP systems. KubeVirt exposes this libvirt feature through the ioThreadsPolicy setting. Additionally, each Disk device exposes a dedicatedIOThread setting. This is a boolean that indicates the specified disk should be allocated an exclusive IOThread that will never be shared with other disks.
Currently valid policies are shared and auto. If ioThreadsPolicy is omitted entirely, use of IOThreads will be disabled. However, if any disk requests a dedicated IOThread, ioThreadsPolicy will be enabled and default to shared.
An ioThreadsPolicy of shared indicates that KubeVirt should use one thread that will be shared by all disk devices. This policy stems from the fact that large numbers of IOThreads is generally not useful as additional context switching is incurred for each thread.
Disks with dedicatedIOThread set to true will not use the shared thread, but will instead be allocated an exclusive thread. This is generally useful if a specific Disk is expected to have heavy I/O traffic, e.g. a database spindle.
auto IOThreads indicates that KubeVirt should use a pool of IOThreads and allocate disks to IOThreads in a round-robin fashion. The pool size is generally limited to twice the number of VCPU's allocated to the VM. This essentially attempts to dedicate disks to separate IOThreads, but only up to a reasonable limit. This would come in to play for systems with a large number of disks and a smaller number of CPU's for instance.
As a caveat to the size of the IOThread pool, disks with dedicatedIOThread will always be guaranteed their own thread. This effectively diminishes the upper limit of the number of threads allocated to the rest of the disks. For example, a VM with 2 CPUs would normally use 4 IOThreads for all disks. However if one disk had dedicatedIOThread set to true, then KubeVirt would only use 3 IOThreads for the shared pool.
There is always guaranteed to be at least one thread for disks that will use the shared IOThreads pool. Thus if a sufficiently large number of disks have dedicated IOThreads assigned, auto and shared policies would essentially result in the same layout.
"},{"location":"storage/disks_and_volumes/#iothreads-with-dedicated-pinned-cpus","title":"IOThreads with Dedicated (pinned) CPUs","text":"
When guest's vCPUs are pinned to a host's physical CPUs, it is also best to pin the IOThreads to specific CPUs to prevent these from floating between the CPUs. KubeVirt will automatically calculate and pin each IOThread to a CPU or a set of CPUs, depending on the ration between them. In case there are more IOThreads than CPUs, each IOThread will be pinned to a CPU, in a round-robin fashion. Otherwise, when there are fewer IOThreads than CPU, each IOThread will be pinned to a set of CPUs.
"},{"location":"storage/disks_and_volumes/#iothreads-with-qemu-emulator-thread-and-dedicated-pinned-cpus","title":"IOThreads with QEMU Emulator thread and Dedicated (pinned) CPUs","text":"
To further improve the vCPUs latency, KubeVirt can allocate an additional dedicated physical CPU1, exclusively for the emulator thread, to which it will be pinned. This will effectively \"isolate\" the emulator thread from the vCPUs of the VMI. When ioThreadsPolicy is set to auto IOThreads will also be \"isolated\" from the vCPUs and placed on the same physical CPU as the QEMU emulator thread.
This VM is identical to the first, except it requests auto IOThreads. emptydisk and emptydisk2 will still be allocated individual IOThreads, but the rest of the disks will be split across 2 separate iothreads (twice the number of CPU cores is 4).
Block Multi-Queue is a framework for the Linux block layer that maps Device I/O queries to multiple queues. This splits I/O processing up across multiple threads, and therefor multiple CPUs. libvirt recommends that the number of queues used should match the number of CPUs allocated for optimal performance.
This feature is enabled by the BlockMultiQueue setting under Devices:
Note: Due to the way KubeVirt implements CPU allocation, blockMultiQueue can only be used if a specific CPU allocation is requested. If a specific number of CPUs hasn't been allocated to a VirtualMachine, KubeVirt will use all CPU's on the node on a best effort basis. In that case the amount of CPU allocation to a VM at the host level could change over time. If blockMultiQueue were to request a number of queues to match all the CPUs on a node, that could lead to over-allocation scenarios. To avoid this, KubeVirt enforces that a specific slice of CPU resources is requested in order to take advantage of this feature.
KubeVirt supports none, writeback, and writethrough KVM/QEMU cache modes.
none I/O from the guest is not cached on the host. Use this option for guests with large I/O requirements. This option is generally the best choice.
writeback I/O from the guest is cached on the host and written through to the physical media when the guest OS issues a flush.
writethrough I/O from the guest is cached on the host but must be written through to the physical medium before the write operation completes.
Important: none cache mode is set as default if the file system supports direct I/O, otherwise, writethrough is used.
Note: It is possible to force a specific cache mode, although if none mode has been chosen and the file system does not support direct I/O then started VMI will return an error.
Shareable disks allow multiple VMs to share the same underlying storage. In order to use this feature, special care is required because this could lead to data corruption and the loss of important data. Shareable disks demand either data synchronization at the application level or the use of clustered filesystems. These advanced configurations are not within the scope of this documentation and are use-case specific.
If the shareable option is set, it indicates to libvirt/QEMU that the disk is going to be accessed by multiple VMs and not to create a lock for the writes.
In this example, we use Rook Ceph in order to dynamically provisioning the PVC.
We can now attempt to write a string from the first guest and then read the string from the second guest to test that the sharing is working.
$ virtctl console vm-block-1\n$ printf \"Test awesome shareable disks\" | sudo dd of=/dev/vdc bs=1 count=150 conv=notrunc\n28+0 records in\n28+0 records out\n28 bytes copied, 0.0264182 s, 1.1 kB/s\n# Log into the second guest\n$ virtctl console vm-block-2\n$ sudo dd if=/dev/vdc bs=1 count=150 conv=notrunc\nTest awesome shareable disks150+0 records in\n150+0 records out\n150 bytes copied, 0.136753 s, 1.1 kB/s\n
If you are using local devices or RWO PVCs, setting the affinity on the VMs that share the storage guarantees they will be scheduled on the same node. In the example, we set the affinity on the second VM using the label used on the first VM. If you are using shared storage with RWX PVCs, then the affinity rule is not necessary as the storage can be attached simultaneously on multiple nodes.
"},{"location":"storage/disks_and_volumes/#sharing-directories-with-vms","title":"Sharing Directories with VMs","text":"
Virtiofs allows to make visible external filesystems to KubeVirt VMs. Virtiofs is a shared file system that lets VMs access a directory tree on the host. Further details can be found at Official Virtiofs Site.
"},{"location":"storage/disks_and_volumes/#non-privileged-and-privileged-sharing-modes","title":"Non-Privileged and Privileged Sharing Modes","text":"
KubeVirt supports two PVC sharing modes: non-privileged and privileged.
The non-privileged mode is enabled by default. This mode has the advantage of not requiring any administrative privileges for creating the VM. However, it has some limitations:
The virtiofsd daemon (the daemon in charge of sharing the PVC with the VM) will run with the QEMU UID/GID (107), and cannot switch between different UIDs/GIDs. Therefore, it will only have access to directories and files that UID/GID 107 has permission to. Additionally, when creating new files they will always be created with QEMU's UID/GID regardless of the UID/GID of the process within the guest.
Extended attributes are not supported.
To switch to the privileged mode, the feature gate ExperimentalVirtiofsSupport has to be enabled. Take into account that this mode requires privileges to run rootful containers.
"},{"location":"storage/disks_and_volumes/#configuration-inside-the-vm","title":"Configuration Inside the VM","text":"
The following configuration can be done in using startup script. See cloudInitNoCloud section for more details. However, we can do it manually by logging in to the VM and mounting it. Here are examples of how to mount it in a linux and windows VMs:
It is allowed using hostpaths. The following configuration example is shown for illustrative purposes. However, the PVCs method is preferred since using hostpath is generally discouraged for security reasons.
"},{"location":"storage/disks_and_volumes/#configuration-inside-the-node","title":"Configuration Inside the Node","text":"
To share the directory with the VMs, we need to log in to the node, create the shared directory (if it does not already exist), and set the proper SELinux context label container_file_t to the shared directory. In this example we are going to share a new directory /mnt/data (if the desired directory is an existing one, you can skip the mkdir command):
Note: If you are attempting to share an existing directory, you must first check the SELinux context label with the command ls -Z <directory>. In the case that the label is not present or is not container_file_t you need to label it with the chcon command.
The updateVolumesStrategy field is used to specify the strategy for updating the volumes of a running VM. The following strategies are supported: * Replacement: the update volumes will be replaced upon the VM restart. * Migration: the update of the volumes will trigger a storage migration of the old volumes to the new ones. More details about volume migration can be found in the volume migration documentation.
The update volume migration depends on the feature gate VolumesUpdateStrategy which depends on the VMLiveUpdateFeatures feature gate and configuration.
It can be desirable to export a Virtual Machine and its related disks out of a cluster so you can import that Virtual Machine into another system or cluster. The Virtual Machine disks are the most prominent things you will want to export. The export API makes it possible to declaratively export Virtual Machine disks. It is also possible to export individual PVCs and their contents, for instance when you have created a memory dump from a VM or are using virtio-fs to have a Virtual Machine populate a PVC.
In order not to overload the kubernetes API server the data is transferred through a dedicated export proxy server. The proxy server can then be exposed to the outside world through a service associated with an Ingress/Route or NodePort. As an alternative, the port-forward flag can be used with the virtctl integration to bypass the need of an Ingress/Route.
VMExport support must be enabled in the feature gates to be available. The feature gates field in the KubeVirt CR must be expanded by adding the VMExport to it.
In order to securely export a Virtual Machine Disk, you must create a token that is used to authorize users accessing the export endpoint. This token must be in the same namespace as the Virtual Machine. The contents of the secret can be passed as a token header or parameter to the export URL. The name of the header or argument is x-kubevirt-export-token with a value that matches the content of the secret. The secret can be named any valid secret in the namespace. We recommend you generate an alpha numeric token of at least 12 characters. The data key should be token. For example:
After you have created the token you can now create a VMExport CR that identifies the Virtual Machine you want to export. You can create a VMExport that looks like this:
The following volumes present in the VM will be exported:
PersistentVolumeClaims
DataVolumes
MemoryDump
All other volume types are not exported. To avoid the export of inconsistent data, a Virtual Machine can only be exported while it is powered off. Any active VM exports will be terminated if the Virtual Machine is started. To export data from a running Virtual Machine you must first create a Virtual Machine Snapshot (see below).
If the VM contains multiple volumes that can be exported, each volume will get its own URL links. If the VM contains no volumes that can be exported, the VMExport will go into a Skipped phase, and no export server is started.
When you create a VMExport based on a Virtual Machine Snapshot, the controller will attempt to create PVCs from the volume snapshots contained in Virtual Machine Snapshot. Once all the PVCs are ready, the export server will start and you can begin the export. If the Virtual Machine Snapshot contains multiple volumes that can be exported, each volume will get its own URL links. If the Virtual Machine snapshot contains no volumes that can be exported, the VMExport will go into a skipped phase, and no export server is started.
In this example the PVC name is example-pvc. Note the PVC doesn't need to contain a Virtual Machine Disk, it can contain any content, but the main use case is exporting Virtual Machine Disks. After you post this yaml to the cluster, a new export server is created in the same namespace as the PVC. If the source PVC is in use by another pod (such as the virt-launcher pod) then the export will remain pending until the PVC is no longer in use. If the exporter server is active and another pod starts using the PVC, the exporter server will be terminated until the PVC is not in use anymore.
"},{"location":"storage/export_api/#export-status-links","title":"Export status links","text":"
The VirtualMachineExport CR will contain a status with internal and external links to the export service. The internal links are only valid inside the cluster, and the external links are valid for external access through an Ingress or Route. The cert field will contain the CA that signed the certificate of the export server for internal links, or the CA that signed the Route or Ingress.
The following is an example of exporting a PVC that contains a KubeVirt disk image. The controller determines if the PVC contains a kubevirt disk by checking if there is a special annotation on the PVC, or if there is a DataVolume ownerReference on the PVC, or if the PVC has a volumeMode of block.
Archive content-type is automatically selected if we are unable to determine the PVC contains a KubeVirt disk. The archive will contain all the files that are in the PVC.
The VirtualMachine manifests can be retrieved by accessing the manifests in the VirtualMachineExport status. The all type will return the VirtualMachine manifest, any DataVolumes, and a configMap that contains the public CA certificate of the Ingress/Route of the external URL, or the CA of the export server of the internal URL. The auth-header-secret will be a secret that contains a Containerized Data Importer (CDI) compatible header. This header contains a text version of the export token.
Both internal and external links will contain a manifests field. If there are no external links, then there will not be any external manifests either. The virtualMachine manifests field is only available if the source is a VirtualMachine or VirtualMachineSnapshot. Exporting a PersistentVolumeClaim will not generate a Virtual Machine manifest.
Gzip. The raw KubeVirt disk image but gzipped to help with transferring efficiency.
Dir. A directory listing, allowing you to find the files contained in the PVC.
Tar.gz The contents of the PVC tarred and gzipped in a single file.
Raw and Gzip will be selected if the PVC is determined to be a KubeVirt disk. KubeVirt disks contain a single disk.img file (or are a block device). Dir will return a list of the files in the PVC, to download a specific file you can replace /dir in the URL with the path and file name. For instance if the PVC contains the file /example/data.txt you can replace /dir with /example/data.txt to download just data.txt file. Or you can use the tar.gz URL to get all the contents of the PVC in a tar file.
"},{"location":"storage/export_api/#internal-link-certificates","title":"Internal link certificates","text":"
The export server certificate is valid for 7 days after which it is rotated by deleting the export server pod and associated secret and generating a new one. If for whatever reason the export server pod dies, the associated secret is also automatically deleted and a new pod and secret are generated. The VirtualMachineExport object status will be automatically updated to reflect the new certificate.
"},{"location":"storage/export_api/#external-link-certificates","title":"External link certificates","text":"
The external link certificates are associated with the Ingress/Route that points to the service created by the KubeVirt operator. The CA that signed the Ingress/Route will part of the certificates.
"},{"location":"storage/export_api/#ttl-time-to-live-for-an-export","title":"TTL (Time to live) for an Export","text":"
For various reasons (security being one), users should be able to specify a TTL for the VMExport objects that limits the lifetime of an export. This is done via the ttlDuration field which accepts a k8s duration, which defaults to 2 hours when not specified.
# Creates a VMExport object according to the specified flag.\n\n# The flag should either be:\n\n# --pvc, to specify the name of the pvc to export.\n# --snapshot, to specify the name of the VM snapshot to export.\n# --vm, to specify the name of the Virtual Machine to export.\n\n$ virtctl vmexport create name [flags]\n
# Downloads a volume from the defined VMExport object.\n\n# The main available flags are:\n\n# --output, mandatory flag to specify the output file.\n# --volume, optional flag to specify the name of the downloadable volume.\n# --vm|--snapshot|--pvc, if specified, are used to create the VMExport object assuming it doesn't exist. The name of the object to export has to be specified.\n# --format, optional flag to specify wether to download the file in compressed (default) or raw format.\n# --port-forward, optional flag to easily download the volume without the need of an ingress or route. Also, the local port can be optionally specified with the --local-port flag.\n\n$ virtctl vmexport download name [flags]\n
By default, the volume will be downloaded in compressed format. Users can specify the desired format (gzip or raw) by using the format flag, as shown below:
# Downloads a volume from the defined VMExport object and, if necessary, decompresses it.\n$ virtctl vmexport download name --format=raw [flags]\n
"},{"location":"storage/export_api/#ttl-time-to-live","title":"TTL (Time to live)","text":"
TTL can also be added when creating a VMExport via virtctl
$ virtctl vmexport create name --ttl=1h\n
For more information about usage and examples:
$ virtctl vmexport --help\n\nExport a VM volume.\n\nUsage:\n virtctl vmexport [flags]\n\nExamples:\n # Create a VirtualMachineExport to export a volume from a virtual machine:\n virtctl vmexport create vm1-export --vm=vm1\n\n # Create a VirtualMachineExport to export a volume from a virtual machine snapshot\n virtctl vmexport create snap1-export --snapshot=snap1\n\n # Create a VirtualMachineExport to export a volume from a PVC\n virtctl vmexport create pvc1-export --pvc=pvc1\n\n # Delete a VirtualMachineExport resource\n virtctl vmexport delete snap1-export\n\n # Download a volume from an already existing VirtualMachineExport (--volume is optional when only one volume is available)\n virtctl vmexport download vm1-export --volume=volume1 --output=disk.img.gz\n\n # Create a VirtualMachineExport and download the requested volume from it\n virtctl vmexport download vm1-export --vm=vm1 --volume=volume1 --output=disk.img.gz\n\nFlags:\n -h, --help help for vmexport\n --insecure When used with the 'download' option, specifies that the http request should be insecure.\n --keep-vme When used with the 'download' option, specifies that the vmexport object should not be deleted after the download finishes.\n --output string Specifies the output path of the volume to be downloaded.\n --pvc string Sets PersistentVolumeClaim as vmexport kind and specifies the PVC name.\n --snapshot string Sets VirtualMachineSnapshot as vmexport kind and specifies the snapshot name.\n --vm string Sets VirtualMachine as vmexport kind and specifies the vm name.\n --volume string Specifies the volume to be downloaded.\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
"},{"location":"storage/export_api/#use-cases","title":"Use cases","text":""},{"location":"storage/export_api/#clone-vm-from-one-cluster-to-another-cluster","title":"Clone VM from one cluster to another cluster","text":"
If you want to transfer KubeVirt disk images from a source cluster to another target cluster, you can use the VMExport in the source to expose the disks and use Containerized Data Importer (CDI) in the target cluster to import the image into the target cluster. Let's assume we have an Ingress or Route in the source cluster that exposes the export proxy with the following example domain virt-exportproxy-example.example.com and we have a Virtual Machine in the source cluster with one disk, which looks like this:
This is a VM that has a DataVolume (DV) example-dv that is populated from a container disk and we want to export that disk to the target cluster. To export this VM we have to create a token that we can use in the target cluster to get access to the export, or we can let the export controller generate one for us. For example
Note in this example we are in the example namespace in the source cluster, which is why the internal links domain ends with .example.svc. The external links are what will be visible to outside of the source cluster, so we can use that for when we import into the target cluster.
Now we are ready to import this disk into the target cluster. In order for CDI to import, we will need to provide appropriate yaml that contains the following: - CA cert (as config map) - The token needed to access the disk images in a CDI compatible format - The VM yaml - DataVolume yaml (optional if not part of the VM definition)
virtctl provides an additional argument to the download command called --manifest that will retrieve the appropriate information from the export server, and either save it to a file with the --output argument or write to standard out. By default this output will not contain the header secret as it contains the token in plaintext. To get the header secret you specify the --include-secret argument. The default output format is yaml but it is possible to get json output as well.
Assuming there is a running VirtualMachineExport called example-export and the same namespace exists in the target cluster. The name of the kubeconfig of the target cluster is named kubeconfig-target, to clone the vm into the target cluster run the following commands:
The first command generates the yaml and writes it to import.yaml. The second command applies the generated yaml to the target cluster. It is possible to combine the two commands writing to standard out with the first command, and piping it into the second command. Use this option if the export token should not be written to a file anywhere. This will create the VM in the target cluster, and provides CDI in the target cluster with everything required to import the disk images.
After the import completes you should be able to start the VM in the target cluster.
"},{"location":"storage/export_api/#download-a-vm-volume-locally-using-virtctl-vmexport","title":"Download a VM volume locally using virtctl vmexport","text":"
Several steps from the previous section can be simplified considerably by using the vmexport command.
Again, let's assume we have an Ingress or Route in our cluster that exposes the export proxy, and that we have a Virtual Machine in the cluster with one disk like this:
Once we meet these requirements, the process of downloading the volume locally can be accomplished by different means:
"},{"location":"storage/export_api/#performing-each-step-separately","title":"Performing each step separately","text":"
We can download the volume by performing every single step in a different command. We start by creating the export object:
# We use an arbitrary name for the VMExport object, but specify our VM name in the flag.\n\n$ virtctl vmexport create vmexportname --vm=example-vm\n
Then, we download the volume in the specified output:
# Since our virtual machine only has one volume, there's no need to specify the volume name with the --volume flag.\n\n# After the download, the VMExport object is deleted by default, so we are using the optional --keep-vme flag to delete it manually.\n\n$ virtctl vmexport download vmexportname --output=/tmp/disk.img --keep-vme\n
Lastly, we delete the VMExport object:
$ virtctl vmexport delete vmexportname\n
"},{"location":"storage/export_api/#performing-one-single-step","title":"Performing one single step","text":"
All the previous steps can be simplified in one, single command:
# Since we are using a create flag (--vm) with download, the command creates the object assuming the VMExport doesn't exist.\n\n# Also, since we are not using --keep-vme, the VMExport object is deleted after the download.\n\n$ virtctl vmexport download vmexportname --vm=example-vm --output=/tmp/disk.img\n
After the download finishes, we can find our disk in /tmp/disk.img.
"},{"location":"storage/guestfs/","title":"Usage of libguestfs-tools and virtctl guestfs","text":"
Libguestfs tools are a set of utilities for accessing and modifying VM disk images. The command virtctl guestfs helps to deploy an interactive container with the libguestfs-tools and the PVC attached to it. This command is particularly useful if the users need to modify, inspect or debug VM disks on a PVC.
$ virtctl guestfs -h\nCreate a pod with libguestfs-tools, mount the pvc and attach a shell to it. The pvc is mounted under the /disks directory inside the pod for filesystem-based pvcs, or as /dev/vda for block-based pvcs\n\nUsage:\n virtctl guestfs [flags]\n\nExamples:\n # Create a pod with libguestfs-tools, mount the pvc and attach a shell to it:\n virtctl guestfs <pvc-name>\n\nFlags:\n -h, --help help for guestfs\n --image string libguestfs-tools container image\n --kvm Use kvm for the libguestfs-tools container (default true)\n --pull-policy string pull policy for the libguestfs image (default \"IfNotPresent\")\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
By default virtctl guestfs sets up kvm for the interactive container. This considerably speeds up the execution of the libguestfs-tools since they use QEMU. If the cluster doesn't have any kvm supporting nodes, the user must disable kvm by setting the option --kvm=false. If not set, the libguestfs-tools pod will remain pending because it cannot be scheduled on any node.
The command automatically uses the image exposed by KubeVirt under the http endpoint /apis/subresources.kubevirt.io/<kubevirt-version>/guestfs, but it can be configured to use a custom image by using the option --image. Users can also overwrite the pull policy of the image by setting the option pull-policy.
The command checks if a PVC is used by another pod in which case it will fail. However, once libguestfs-tools has started, the setup doesn't prevent a new pod starting and using the same PVC. The user needs to verify that there are no active virtctl guestfs pods before starting the VM which accesses the same PVC.
Currently, virtctl guestfs supports only a single PVC. Future versions might support multiple PVCs attached to the interactive pod.
"},{"location":"storage/guestfs/#examples-and-use-cases","title":"Examples and use-cases","text":"
Generally, the user can take advantage of the virtctl guestfs command for all typical usage of libguestfs-tools. It is strongly recommended to consult the official documentation. This command simply aims to help in configuring the correct containerized environment in the Kubernetes cluster where KubeVirt is installed.
For all the examples, the user has to start the interactive container by referencing the PVC in the virtctl guestfs command. This will deploy the interactive pod and attach the stdin and stdout.
Example:
$ virtctl guestfs pvc-test\nUse image: registry:5000/kubevirt/libguestfs-tools@sha256:6644792751b2ba9442e06475a809448b37d02d1937dbd15ad8da4d424b5c87dd \nThe PVC has been mounted at /disk \nWaiting for container libguestfs still in pending, reason: ContainerCreating, message: \nWaiting for container libguestfs still in pending, reason: ContainerCreating, message: \nbash-5.0#\n
Once the libguestfs-tools pod has been deployed, the user can access the disk and execute the desired commands. Later, once the user has completed the operations on the disk, simply exit the container and the pod be will automatically terminated.
Inspect the disk filesystem to retrive the version of the OS on the disk:
KubeVirt now supports hotplugging volumes into a running Virtual Machine Instance (VMI). The volume must be either a block volume or contain a disk image. When a VM that has hotplugged volumes is rebooted, the hotplugged volumes will be attached to the restarted VM. If the volumes are persisted they will become part of the VM spec, and will not be considered hotplugged. If they are not persisted, the volumes will be reattached as hotplugged volumes
Hotplug volume support must be enabled in the feature gates to be supported. The feature gates field in the KubeVirt CR must be expanded by adding the HotplugVolumes to it.
In order to hotplug a volume, you must first prepare a volume. This can be done by using a DataVolume (DV). In the example we will use a blank DV in order to add some extra storage to a running VMI
In this example we are using ReadWriteOnce accessMode, and the default FileSystem volume mode. Volume hotplugging supports all combinations of block volume mode and ReadWriteMany/ReadWriteOnce/ReadOnlyMany accessModes, if your storage supports the combination."},{"location":"storage/hotplug_volumes/#addvolume","title":"Addvolume","text":"
Now lets assume we have started a VMI like the Fedora VMI in examples and the name of the VMI is 'vmi-fedora'. We can add the above blank volume to this running VMI by using the 'addvolume' command available with virtctl
This will hotplug the volume into the running VMI, and set the serial of the disk to the volume name. In this example it is set to example-hotplug-volume.
The bus of hotplug disk is specified as a scsi disk. Why is it not specified as virtio instead, like regular disks? The reason is a limitation of virtio disks that each disk uses a pcie slot in the virtual machine and there is a maximum of 32 slots. This means there is a low limit on the maximum number of disks you can hotplug especially given that other things will also need pcie slots. Another issue is these slots need to be reserved ahead of time. So if the number of hotplugged disks is not known ahead of time, it is impossible to properly reserve the required number of slots. To work around this issue, each VM has a virtio-scsi controller, which allows the use of a scsi bus for hotplugged disks. This controller allows for hotplugging of over 4 million disks. virtio-scsi is very close in performance to virtio
The serial will be used in the guest so you can identify the disk inside the guest by the serial. For instance in Fedora the disk by id will contain the serial.
$ virtctl console vmi-fedora\n\nFedora 32 (Cloud Edition)\nKernel 5.6.6-300.fc32.x86_64 on an x86_64 (ttyS0)\n\nSSH host key: SHA256:c8ik1A9F4E7AxVrd6eE3vMNOcMcp6qBxsf8K30oC/C8 (ECDSA)\nSSH host key: SHA256:fOAKptNAH2NWGo2XhkaEtFHvOMfypv2t6KIPANev090 (ED25519)\neth0: 10.244.196.144 fe80::d8b7:51ff:fec4:7099\nvmi-fedora login:fedora\nPassword:fedora\n[fedora@vmi-fedora ~]$ ls /dev/disk/by-id\nscsi-0QEMU_QEMU_HARDDISK_1234567890\n[fedora@vmi-fedora ~]$ \n
As you can see the serial is part of the disk name, so you can uniquely identify it.
The format and length of serials are specified according to the libvirt documentation:
If present, this specify serial number of virtual hard drive. For example, it may look like <serial>WD-WMAP9A966149</serial>. Not supported for scsi-block devices, that is those using disk type 'block' using device 'lun' on bus 'scsi'. Since 0.7.1\n\n Note that depending on hypervisor and device type the serial number may be truncated silently. IDE/SATA devices are commonly limited to 20 characters. SCSI devices depending on hypervisor version are limited to 20, 36 or 247 characters.\n\n Hypervisors may also start rejecting overly long serials instead of truncating them in the future so it's advised to avoid the implicit truncation by testing the desired serial length range with the desired device and hypervisor combination.\n
"},{"location":"storage/hotplug_volumes/#supported-disk-types","title":"Supported Disk types","text":"
Kubevirt supports hotplugging disk devices of type disk and lun. As with other volumes, using type disk will expose the hotplugged volume as a regular disk, while using lun allows additional functionalities like the execution of iSCSI commands.
You can specify the desired type by using the --disk-type parameter, for example:
# Allowed values are lun and disk. If no option is specified, we use disk by default.\n$ virtctl addvolume vmi-fedora --volume-name=example-lun-hotplug --disk-type=lun\n
"},{"location":"storage/hotplug_volumes/#retain-hotplugged-volumes-after-restart","title":"Retain hotplugged volumes after restart","text":"
In many cases it is desirable to keep hotplugged volumes after a VM restart. It may also be desirable to be able to unplug these volumes after the restart. The persist option makes it impossible to unplug the disks after a restart. If you don't specify persist the default behaviour is to retain hotplugged volumes as hotplugged volumes after a VM restart. This makes the persist flag mostly obsolete unless you want to make a volume permanent on restart.
In some cases you want a hotplugged volume to become part of the standard disks after a restart of the VM. For instance if you added some permanent storage to the VM. We also assume that the running VMI has a matching VM that defines it specification. You can call the addvolume command with the --persist flag. This will update the VM domain disks section in addition to updating the VMI domain disks. This means that when you restart the VM, the disk is already defined in the VM, and thus in the new VMI.
VMI objects have a new status.VolumeStatus field. This is an array containing each disk, hotplugged or not. For example, after hotplugging the volume in the addvolume example, the VMI status will contain this:
Vda is the container disk that contains the Fedora OS, vdb is the cloudinit disk. As you can see those just contain the name and target used when assigning them to the VM. The target is the value passed to QEMU when specifying the disks. The value is unique for the VM and does NOT represent the naming inside the guest. For instance for a Windows Guest OS the target has no meaning. The same will be true for hotplugged volumes. The target is just a unique identifier meant for QEMU, inside the guest the disk can be assigned a different name.
The hotplugVolume has some extra information that regular volume statuses do not have. The attachPodName is the name of the pod that was used to attach the volume to the node the VMI is running on. If this pod is deleted it will also stop the VMI as we cannot guarantee the volume will remain attached to the node. The other fields are similar to conditions and indicate the status of the hot plug process. Once a Volume is ready it can be used by the VM.
Currently Live Migration is enabled for any VMI that has volumes hotplugged into it.
NOTE However there is a known issue that the migration may fail for VMIs with hotplugged block volumes if the target node uses CPU manager with static policy and runc prior to version v1.0.0.
KubeVirt leverages the VolumeSnapshot functionality of Kubernetes CSI drivers for capturing persistent VirtualMachine state. So, you should make sure that your VirtualMachine uses DataVolumes or PersistentVolumeClaims backed by a StorageClass that supports VolumeSnapshots and a VolumeSnapshotClass is properly configured for that StorageClass.
KubeVirt looks for Kubernetes Volume Snapshot related APIs/resources in the v1 version. To make sure that KubeVirt's snapshot controller is able to snapshot the VirtualMachine and referenced volumes as expected, Kubernetes Volume Snapshot APIs must be served from v1 version.
To list VolumeSnapshotClasses:
kubectl get volumesnapshotclass\n
Make sure the provisioner property of your StorageClass matches the driver property of the VolumeSnapshotClass
Even if you have no VolumeSnapshotClasses in your cluster, VirtualMachineSnapshots are not totally useless. They will still backup your VirtualMachine configuration.
Snapshot/Restore support must be enabled in the feature gates to be supported. The feature gates field in the KubeVirt CR must be expanded by adding the Snapshot to it.
"},{"location":"storage/snapshot_restore_api/#snapshot-a-virtualmachine","title":"Snapshot a VirtualMachine","text":"
Snapshotting a virtualMachine is supported for online and offline vms.
When snapshotting a running vm the controller will check for qemu guest agent in the vm. If the agent exists it will freeze the vm filesystems before taking the snapshot and unfreeze after the snapshot. It is recommended to take online snapshots with the guest agent for a better snapshot, if not present a best effort snapshot will be taken.
Note To check if your vm has a qemu-guest-agent check for 'AgentConnected' in the vm status.
There will be an indication in the vmSnapshot status if the snapshot was taken online and with or without guest agent participation.
Note Online snapshot with hotplugged disks is supported, only persistent hotplugged disks will be included in the snapshot.
To snapshot a VirtualMachine named larry, apply the following yaml.
You can check the vmSnapshot phase in the vmSnapshot status. It can be one of the following:
InProgress
Succeeded
Failed.
The vmSnapshot has a default deadline of 5 minutes. If the vmSnapshot has not succeessfully completed before the deadline, it will be marked as Failed. The VM will be unfrozen and the created snapshot content will be cleaned up if necessary. The vmSnapshot object will remain in Failed state until deleted by the user. To change the default deadline add 'FailureDeadline' to the VirtualMachineSnapshot spec with a new value. The allowed format is a duration string which is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"-1.5h\" or \"2h45m\"
Keep VirtualMachineSnapshots (and their corresponding VirtualMachineSnapshotContents) around as long as you may want to restore from them again.
Feel free to delete restore-larry as it is not needed once the restore is complete.
"},{"location":"storage/volume_migration/","title":"Migration update volume strategy and volume migration","text":"
Storage migration is possible while the VM is running by using the update volume strategy. Storage migration can be useful in the cases where the users need to change the underlying storage, for example, if the storage class has been deprecated, or there is a new more performant driver available.
This feature doesn't handle the volume creation or cover migration between storage classes, but rather implements a basic API which can be used by overlaying tools to perform more advanced migration planning.
If Migration is specified as updateVolumesStrategy, KubeVirt will try to migrate the storage from the old volume set to the new one when the VirtualMachine spec is updated. The migration considers the changed volumes present into a single update. A single update may contain modifications to more than one volume, but sequential changes to the volume set will be handled as separate migrations.
Updates are declarative and GitOps compatible. For example, a new version of the VM specification with the new volume set and the migration volume update strategy can be directly applied using kubectl apply or interactively editing the VM with kubectl edit
Example: Original VM with a datavolume and datavolume template:
The destination volume may be of a different type or size than the source. It is possible to migrate from and to a block volume as well as a filesystem volume. The destination volume should be equal to or larger than the source volume. However, the additional difference in the size of the destination volume is not instantly visible within the VM and must be manually resized because the guest is unaware of the migration.
The volume migration depends on the VolumeMigration and VolumesUpdateStrategy feature gates and the LiveMigrate workloadUpdateStrategy. To fully enable the feature, add the following to the KubeVirt CR:
The volume migration progress can be monitored by watching the corresponding VirtualMachineInstanceMigration object using the label kubevirt.io/volume-update-in-progress: <vm-name>. Example:
Updating a datavolume that is referenced by a datavolume template requires special caution. The volumes section must include a reference to the name of the datavolume template. This means that the datavolume templates must either be entirely deleted or updated as well.
Example of updating the datavolume for the original VM in the first example:
Only certain types of disks and volumes are supported to be migrated. For an invalid type of volume the RestartRequired condition is set and volumes will be replaced upon VM restart. Currently, the volume migration is supported between PersistentVolumeClaims and Datavolumes. Additionally, volume migration is forbidden if the disk is: * shareable, since it cannot guarantee the data consistency with multiple writers * hotpluggable, this case isn't currently supported * filesystem, since virtiofs doesn't currently support live-migration * lun, originally the disk might support SCSI protocol but the destination PVC class does not. This case isn't currently supported.
Currently, KubeVirt only enables live migration between separate nodes. Volume migration relies on live migration; hence, live migrating storage on the same node is also not possible. Volume migration is possible between local storage, like between 2 PVCs with RWO access mode, but they need to be located on two different host.
"},{"location":"user_workloads/accessing_virtual_machines/","title":"Accessing Virtual Machines","text":""},{"location":"user_workloads/accessing_virtual_machines/#graphical-and-serial-console-access","title":"Graphical and Serial Console Access","text":"
Once a virtual machine is started you are able to connect to the consoles it exposes. Usually there are two types of consoles:
Serial Console
Graphical Console (VNC)
Note: You need to have virtctl installed to gain access to the VirtualMachineInstance.
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-serial-console","title":"Accessing the Serial Console","text":"
The serial console of a virtual machine can be accessed by using the console command:
virtctl console testvm\n
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-graphical-console-vnc","title":"Accessing the Graphical Console (VNC)","text":"
To access the graphical console of a virtual machine the VNC protocol is typically used. This requires remote-viewer to be installed. Once the tool is installed, you can access the graphical console using:
virtctl vnc testvm\n
If you only want to open a vnc-proxy without executing the remote-viewer command, it can be accomplished with:
virtctl vnc --proxy-only testvm\n
This would print the port number on your machine where you can manually connect using any VNC viewer.
If the connection fails, you can use the -v flag to get more verbose output from both virtctl and the remote-viewer tool to troubleshoot the problem.
virtctl vnc testvm -v 4\n
Note: If you are using virtctl via SSH on a remote machine, you need to forward the X session to your machine. Look up the -X and -Y flags of ssh if you are not familiar with that. As an alternative you can proxy the API server port with SSH to your machine (either direct or in combination with kubectl proxy).
A common operational pattern used when managing virtual machines is to inject SSH public keys into the virtual machines at boot. This allows automation tools (like Ansible) to provision the virtual machine. It also gives operators a way of gaining secure and passwordless access to a virtual machine.
KubeVirt provides multiple ways to inject SSH public keys into a virtual machine.
In general, these methods fall into two categories: - Static key injection, which places keys on the virtual machine the first time it is booted. - Dynamic key injection, which allows keys to be dynamically updated both at boot and during runtime.
Once a SSH public key is injected into the virtual machine, it can be accessed via virtctl.
"},{"location":"user_workloads/accessing_virtual_machines/#static-ssh-public-key-injection-via-cloud-init","title":"Static SSH public key injection via cloud-init","text":"
Users creating virtual machines can provide startup scripts to their virtual machines, allowing multiple customization operations.
One option for injecting public SSH keys into a VM is via cloud-init startup script. However, there are more flexible options available.
The virtual machine's access credential API allows statically injecting SSH public keys at startup time independently of the cloud-init user data by placing the SSH public key into a Kubernetes Secret. This allows keeping the application data in the cloud-init user data separate from the credentials used to access the virtual machine.
A Kubernetes Secret can be created from an SSH public key like this:
# Place SSH public key into a Secret\nkubectl create secret generic my-pub-key --from-file=key1=id_rsa.pub\n
The Secret containing the public key is then assigned to a virtual machine using the access credentials API with the noCloud propagation method.
KubeVirt injects the SSH public key into the virtual machine by using the generated cloud-init metadata instead of the user data. This separates the application user data and user credentials.
"},{"location":"user_workloads/accessing_virtual_machines/#dynamic-ssh-public-key-injection-via-qemu-guest-agent","title":"Dynamic SSH public key injection via qemu-guest-agent","text":"
KubeVirt allows the dynamic injection of SSH public keys into a VirtualMachine with the access credentials API.
Utilizing the qemuGuestAgent propagation method, configured Secrets are attached to a VirtualMachine when the VM is started. This allows for dynamic injection of SSH public keys at runtime by updating the attached Secrets.
Please note that new Secrets cannot be attached to a running VM: You must restart the VM to attach the new Secret.
Note: This requires the qemu-guest-agent to be installed within the guest.
Note: When using qemuGuestAgent propagation, the /home/$USER/.ssh/authorized_keys file will be owned by the guest agent. Changes to the file not made by the guest agent will be lost.
Note: More information about the motivation behind the access credentials API can be found in the pull request description that introduced the API.
In the example below the Secret containing the SSH public key is attached to the virtual machine via the access credentials API with the qemuGuestAgent propagation method. This allows updating the contents of the Secret at any time, which will result in the changes getting applied to the running virtual machine immediately. The Secret may also contain multiple SSH public keys.
# Place SSH public key into a secret\nkubectl create secret generic my-pub-key --from-file=key1=id_rsa.pub\n
Now reference this secret in the VirtualMachine spec with the access credentials API using qemuGuestAgent propagation.
# Create a VM referencing the Secret using propagation method qemuGuestAgent\nkubectl create -f - <<EOF\napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n name: testvm\nspec:\n runStrategy: Always\n template:\n spec:\n domain:\n devices:\n disks:\n - disk:\n bus: virtio\n name: containerdisk\n - disk:\n bus: virtio\n name: cloudinitdisk\n rng: {}\n resources:\n requests:\n memory: 1024M\n terminationGracePeriodSeconds: 0\n accessCredentials:\n - sshPublicKey:\n source:\n secret:\n secretName: my-pub-key\n propagationMethod:\n qemuGuestAgent:\n users:\n - fedora\n volumes:\n - containerDisk:\n image: quay.io/containerdisks/fedora:latest\n name: containerdisk\n - cloudInitNoCloud:\n userData: |-\n #cloud-config\n password: fedora\n chpasswd: { expire: False }\n # Disable SELinux for now, so qemu-guest-agent can write the authorized_keys file\n # The selinux-policy is too restrictive currently, see open bugs:\n # - https://bugzilla.redhat.com/show_bug.cgi?id=1917024\n # - https://bugzilla.redhat.com/show_bug.cgi?id=2028762\n # - https://bugzilla.redhat.com/show_bug.cgi?id=2057310\n bootcmd:\n - setenforce 0\n name: cloudinitdisk\nEOF\n
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-vmi-using-virtctl","title":"Accessing the VMI using virtctl","text":"
The user can create a websocket backed network tunnel to a port inside the instance by using the virtualmachineinstances/portforward subresource of the VirtualMachineInstance.
One use-case for this subresource is to forward SSH traffic into the VirtualMachineInstance either from the CLI or a web-UI.
To connect to a VirtualMachineInstance from your local machine, virtctl provides a lightweight SSH client with the ssh command, that uses port forwarding. Refer to the command's help for more details.
virtctl ssh\n
To transfer files from or to a VirtualMachineInstancevirtctl also provides a lightweight SCP client with the scp command. Its usage is similar to the ssh command. Refer to the command's help for more details.
virtctl scp\n
"},{"location":"user_workloads/accessing_virtual_machines/#using-virtctl-as-proxy","title":"Using virtctl as proxy","text":"
If you prefer to use your local OpenSSH client, there are two ways of doing that in combination with virtctl.
Note: Most of this applies to the virtctl scp command too.
The virtctl ssh command has a --local-ssh option. With this option virtctl wraps the local OpenSSH client transparently to the user. The executed SSH command can be viewed by increasing the verbosity (-v 3).
virtctl ssh --local-ssh -v 3 testvm\n
The virtctl port-forward command provides an option to tunnel a single port to your local stdout/stdin. This allows the command to be used in combination with the OpenSSH client's ProxyCommand option.
This allows you to simply call ssh user@vmi/testvmi.mynamespace and your SSH config and virtctl will do the rest. Using this method it becomes easy to set up different identities for different namespaces inside your SSH config.
This feature can also be used with Ansible to automate configuration of virtual machines running on KubeVirt. You can put the snippet above into its own file (e.g. ~/.ssh/virtctl-proxy-config) and add the following lines to your .ansible.cfg:
Note that all port forwarding traffic will be sent over the Kubernetes control plane. A high amount of connections and traffic can increase pressure on the API server. If you regularly need a high amount of connections and traffic consider using a dedicated Kubernetes Service instead.
"},{"location":"user_workloads/accessing_virtual_machines/#rbac-permissions-for-consolevncssh-access","title":"RBAC permissions for Console/VNC/SSH access","text":""},{"location":"user_workloads/accessing_virtual_machines/#using-default-rbac-cluster-roles","title":"Using default RBAC cluster roles","text":"
Every KubeVirt installation starting with version v0.5.1 ships a set of default RBAC cluster roles that can be used to grant users access to VirtualMachineInstances.
The kubevirt.io:admin and kubevirt.io:edit cluster roles have console, VNC and SSH respectively port-forwarding access permissions built into them. By binding either of these roles to a user, they will have the ability to use virtctl to access the console, VNC and SSH.
The default KubeVirt cluster roles grant access to more than just the console, VNC and port-forwarding. The ClusterRole below demonstrates how to craft a custom role, that only allows access to the console, VNC and port-forwarding.
KubeVirt does not come with a UI, it is only extending the Kubernetes API with virtualization functionality.
"},{"location":"user_workloads/boot_from_external_source/","title":"Booting From External Source","text":"
When installing a new guest virtual machine OS, it is often useful to boot directly from a kernel and initrd stored in the host physical machine OS, allowing command line arguments to be passed directly to the installer.
Booting from an external source is supported in Kubevirt starting from version v0.42.0-rc.0. This enables the capability to define a Virtual Machine that will use a custom kernel / initrd binary, with possible custom arguments, during its boot process.
The binaries are provided though a container image. The container is pulled from the container registry and resides on the local node hosting the VMs.
Some use cases for this may be: - For a kernel developer it may be very convenient to launch VMs that are defined to boot from the latest kernel binary that is often being changed. - Initrd can be set with files that need to reside on-memory during all the VM's life-cycle.
initrdPath and kernelPath define the path for the binaries inside the container.
Kernel and Initrd binaries must be owned by qemu user & group.
To change ownership: chown qemu:qemu <binary> when <binary> is the binary file.
kernelArgs can only be provided if a kernel binary is provided (i.e. kernelPath not defined). These arguments will be passed to the default kernel the VM boots from.
imagePullSecret and imagePullPolicy are optional
if imagePullPolicy is Always and the container image is updated then the VM will be booted into the new kernel when VM restarts
All KubeVirt system-components expose Prometheus metrics at their /metrics REST endpoint.
You can consult the complete and up-to-date metric list at kubevirt/monitoring.
"},{"location":"user_workloads/component_monitoring/#custom-service-discovery","title":"Custom Service Discovery","text":"
Prometheus supports service discovery based on Pods and Endpoints out of the box. Both can be used to discover KubeVirt services.
All Pods which expose metrics are labeled with prometheus.kubevirt.io and contain a port-definition which is called metrics. In the KubeVirt release-manifests, the default metrics port is 8443.
The above labels and port informations are collected by a Service called kubevirt-prometheus-metrics. Kubernetes automatically creates a corresponding Endpoint with an equal name:
By watching this endpoint for added and removed IPs to subsets.addresses and appending the metrics port from subsets.ports, it is possible to always get a complete list of ready-to-be-scraped Prometheus targets.
"},{"location":"user_workloads/component_monitoring/#integrating-with-the-prometheus-operator","title":"Integrating with the prometheus-operator","text":"
The prometheus-operator can make use of the kubevirt-prometheus-metrics service to automatically create the appropriate Prometheus config.
KubeVirt's virt-operator checks if the ServiceMonitor custom resource exists when creating an install strategy for deployment. KubeVirt will automatically create a ServiceMonitor resource in the monitorNamespace, as well as an appropriate role and rolebinding in KubeVirt's namespace.
Three settings are exposed in the KubeVirt custom resource to direct KubeVirt to create these resources correctly:
monitorNamespace: The namespace that prometheus-operator runs in. Defaults to openshift-monitoring.
monitorAccount: The serviceAccount that prometheus-operator runs with. Defaults to prometheus-k8s.
serviceMonitorNamespace: The namespace that the serviceMonitor runs in. Defaults to be monitorNamespace
Please note that if you decide to set serviceMonitorNamespace than this namespace must be included in serviceMonitorNamespaceSelector field of Prometheus spec.
If the prometheus-operator for a given deployment uses these defaults, then these values can be omitted.
An example of the KubeVirt resource depicting these default values:
"},{"location":"user_workloads/component_monitoring/#integrating-with-the-okd-cluster-monitoring-operator","title":"Integrating with the OKD cluster-monitoring-operator","text":"
After the cluster-monitoring-operator is up and running, KubeVirt will detect the existence of the ServiceMonitor resource. Because the definition contains the openshift.io/cluster-monitoring label, it will automatically be picked up by the cluster monitor.
"},{"location":"user_workloads/component_monitoring/#metrics-about-virtual-machines","title":"Metrics about Virtual Machines","text":"
The endpoints report metrics related to the runtime behaviour of the Virtual Machines. All the relevant metrics are prefixed with kubevirt_vmi.
The metrics have labels that allow to connect to the VMI objects they refer to. At minimum, the labels will expose node, name and namespace of the related VMI object.
Please note the domain label in the above example. This label is deprecated and it will be removed in a future release. You should identify the VMI using the node, namespace, name labels instead.
"},{"location":"user_workloads/component_monitoring/#important-queries","title":"Important Queries","text":""},{"location":"user_workloads/component_monitoring/#detecting-connection-issues-for-the-rest-client","title":"Detecting connection issues for the REST client","text":"
Use the following query to get a counter for all REST call which indicate connection issues:
rest_client_requests_total{code=\"<error>\"}\n
If this counter is continuously increasing, it is an indicator that the corresponding KubeVirt component has general issues to connect to the apiserver
The virtctl sub command create vm allows easy creation of VirtualMachine manifests from the command line. It leverages instance types and preferences and inference by default (see Specifying or inferring instance types and preferences) and provides several flags to control details of the created virtual machine.
For example there are flags to specify the name or run strategy of a virtual machine or flags to add volumes to a virtual machine. Instance types and preferences can either be specified directly or it is possible to let KubeVirt infer those from the volume used to boot the virtual machine.
For a full set of flags and their description use the following command:
virtctl create vm -h\n
"},{"location":"user_workloads/creating_vms/#creating-virtualmachines-on-a-cluster","title":"Creating VirtualMachines on a cluster","text":"
The output of virtctl create vm can be piped into kubectl to directly create a VirtualMachine on a cluster, e.g.:
# Create a VM with name my-vm on the cluster\nvirtctl create vm --name my-vm | kubectl create -f -\nvirtualmachine.kubevirt.io/my-vm created\n
The virtctl subcommand create instancetype allows easy creation of an instance type manifest from the command line. The command also provides several flags that can be used to create your desired manifest.
There are two required flags that need to be specified: the number of vCPUs and the amount of memory to be requested. Additionally, there are several optional flags that can be used, such as specifying a list of GPUs for passthrough, choosing the desired IOThreadsPolicy, or simply providing the name of our instance type.
By default, the command creates the cluster-wide resource. If the user wants to create the namespaced version, they need to provide the namespaced flag. The namespace name can be specified by using the namespace flag.
For a complete list of flags and their descriptions, use the following command:
The virtctl subcommand create preference allows easy creation of a preference manifest from the command line. This command serves as a starting point to create the basic structure of a manifest, as it does not allow specifying all of the options that are supported in preferences.
The current set of flags allows us, for example, to specify the preferred CPU topology, machine type or a storage class.
By default, the command creates the cluster-wide resource. If the user wants to create the namespaced version, they need to provide the namespaced flag. The namespace name can be specified by using the namespace flag.
For a complete list of flags and their descriptions, use the following command:
"},{"location":"user_workloads/creating_vms/#specifying-or-inferring-instance-types-and-preferences","title":"Specifying or inferring instance types and preferences","text":"
Instance types and preference can be specified with the appropriate flags, e.g.:
virtctl create vm --instancetype my-instancetype --preference my-preference\n
The type of the instance type or preference (namespaced or cluster scope) can be controlled by prefixing the instance type or preference name with the corresponding CRD name, e.g.:
# Using a cluster scoped instancetype and a namespaced preference\nvirtctl create vm \\\n --instancetype virtualmachineclusterinstancetype/my-instancetype \\\n --preference virtualmachinepreference/my-preference\n
If a prefix was not supplied the cluster scoped resources will be used by default.
To explicitly infer instance types and/or preferences from the volume used to boot the virtual machine add the following flags:
virtctl create vm --infer-instancetype --infer-preference\n
The implicit default is to always try inferring an instancetype and preference from the boot volume. This feature makes use of the IgnoreInferFromVolumeFailure policy, which suppresses failures on inference of instancetypes and preferences. If one of the above switches was provided explicitly, then the RejectInferFromVolumeFailure policy is used instead. This way users are made aware of potential issues during the virtual machine creation.
To infer an instancetype or preference from another volume than the volume used to boot the virtual machine, use the --infer-instancetype-from and --infer-preference-from flags to specify any of the virtual machine's volumes.
# This virtual machine will boot from volume-a, but the instancetype and\n# preference are inferred from volume-b.\nvirtctl create vm \\\n --volume-import=type:pvc,src:my-ns/my-pvc-a,name:volume-a \\\n --volume-import=type:pvc,src:my-ns/my-pvc-b,name:volume-b \\\n --infer-instancetype-from volume-b \\\n --infer-preference-from volume-b\n
"},{"location":"user_workloads/creating_vms/#boot-order-of-added-volumes","title":"Boot order of added volumes","text":"
Please note that volumes of different kinds currently have the following fixed boot order regardless of the order their flags were specified on the command line:
ContainerDisk
DataSource
Cloned PVC
Directly used PVC
If multiple volumes of the same kind were specified their order is determined by the order in which their flags were specified.
"},{"location":"user_workloads/creating_vms/#specifying-cloud-init-user-data","title":"Specifying cloud-init user data","text":"
To pass cloud-init user data to virtctl it needs to be encoded into a base64 string. Here is an example how to do it:
# Put your cloud-init user data into a file.\n# This will add an authorized key to the default user.\n# To get the default username read the documentation for the cloud image\n$ cat cloud-init.txt\n#cloud-config\nssh_authorized_keys:\n - ssh-rsa AAAA...\n\n# Base64 encode the contents of the file without line wraps and store it in a variable\n$ CLOUD_INIT_USERDATA=$(base64 -w 0 cloud-init.txt)\n\n# Show the contents of the variable\n$ echo $CLOUD_INIT_USERDATA I2Nsb3VkLWNvbmZpZwpzc2hfYXV0aG9yaXplZF9rZXlzOgogIC0gc3NoLXJzYSBBQUFBLi4uCg==\n
You can now use this variable as an argument to the --cloud-init-user-data flag:
virtctl create vm --cloud-init-user-data $CLOUD_INIT_USERDATA\n
Create a manifest for a VirtualMachine with a random name:
virtctl create vm\n
Create a manifest for a VirtualMachine with a specified name and RunStrategy Always
virtctl create vm --name=my-vm --run-strategy=Always\n
Create a manifest for a VirtualMachine with a specified VirtualMachineClusterInstancetype
virtctl create vm --instancetype=my-instancetype\n
Create a manifest for a VirtualMachine with a specified VirtualMachineInstancetype (namespaced)
virtctl create vm --instancetype=virtualmachineinstancetype/my-instancetype\n
Create a manifest for a VirtualMachine with a specified VirtualMachineClusterPreference
virtctl create vm --preference=my-preference\n
Create a manifest for a VirtualMachine with a specified VirtualMachinePreference (namespaced)
virtctl create vm --preference=virtualmachinepreference/my-preference\n
Create a manifest for a VirtualMachine with an ephemeral containerdisk volume
virtctl create vm --volume-containerdisk=src:my.registry/my-image:my-tag\n
Create a manifest for a VirtualMachine with a cloned DataSource in namespace and specified size
virtctl create vm --volume-datasource=src:my-ns/my-ds,size:50Gi\n
Create a manifest for a VirtualMachine with a cloned DataSource and inferred instancetype and preference
virtctl create vm --volume-datasource=src:my-annotated-ds --infer-instancetype --infer-preference\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and cloned PVC
virtctl create vm --volume-clone-pvc=my-ns/my-pvc\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and directly used PVC
virtctl create vm --volume-pvc=my-pvc\n
Create a manifest for a VirtualMachine with a clone DataSource and a blank volume
virtctl create vm --volume-datasource=src:my-ns/my-ds --volume-blank=size:50Gi\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and cloned DataSource
virtctl create vm --instancetype=my-instancetype --preference=my-preference --volume-datasource=src:my-ds\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and two cloned DataSources (flag can be provided multiple times)
virtctl create vm --instancetype=my-instancetype --preference=my-preference --volume-datasource=src:my-ds1 --volume-datasource=src:my-ds2\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and directly used PVC
virtctl create vm --instancetype=my-instancetype --preference=my-preference --volume-pvc=my-pvc\n
The kubevirt/common-instancetypes provide a set of instancetypes and preferences to help create KubeVirt VirtualMachines.
Beginning with the 1.1 release of KubeVirt, cluster wide resources can be deployed directly through KubeVirt, without another operator. This allows deployment of a set of default instancetypes and preferences along side KubeVirt.
"},{"location":"user_workloads/deploy_common_instancetypes/#enable-automatic-deployment-of-common-instancetypes","title":"Enable automatic deployment of common-instancetypes","text":"
To enable the deployment of cluster-wide common-instancetypes through the KubeVirt virt-operator, the CommonInstancetypesDeploymentGate feature gate needs to be enabled.
For customization purposes or to install namespaced resources, common-instancetypes can also be deployed by hand.
To install all resources provided by the kubevirt/common-instancetypes project without further customizations, simply apply with kustomize enabled (-k flag):
Guest Agent (GA) is an optional component that can run inside of Virtual Machines. The GA provides plenty of additional runtime information about the running operating system (OS). More technical detail about available GA commands is available here.
"},{"location":"user_workloads/guest_agent_information/#guest-agent-info-in-virtual-machine-status","title":"Guest Agent info in Virtual Machine status","text":"
GA presence in the Virtual Machine is signaled with a condition in the VirtualMachineInstance status. The condition tells that the GA is connected and can be used.
When the Guest Agent is not present in the Virtual Machine, the Guest Agent information is not shown. No error is reported because the Guest Agent is an optional component.
The infoSource field indicates where the info is gathered from. Valid values:
domain: the info is based on the domain spec
guest-agent: the info is based on Guest Agent report
domain, guest-agent: the info is based on both the domain spec and the Guest Agent report
"},{"location":"user_workloads/guest_agent_information/#guest-agent-info-available-through-the-api","title":"Guest Agent info available through the API","text":"
The data shown in the VirtualMachineInstance status are a subset of the information available. The rest of the data is available via the REST API exposed in the Kubernetes kube-api server.
There are three new subresources added to the VirtualMachineInstance object:
- guestosinfo\n- userlist\n- filesystemlist\n
The whole GA data is returned via guestosinfo subresource available behind the API endpoint.
"},{"location":"user_workloads/guest_operating_system_information/#use-with-presets","title":"Use with presets","text":"
A VirtualMachineInstancePreset representing an operating system with a kubevirt.io/os label could be applied on any given VirtualMachineInstance that have and match the kubevirt.io/os label.
Default presets for the OS identifiers above are included in the current release.
"},{"location":"user_workloads/guest_operating_system_information/#windows-server-2012r2-virtualmachineinstancepreset-example","title":"Windows Server 2012R2 VirtualMachineInstancePreset Example","text":"
KubeVirt supports quite a lot of so-called \"HyperV enlightenments\", which are optimizations for Windows Guests. Some of these optimization may require an up to date host kernel support to work properly, or to deliver the maximum performance gains.
KubeVirt can perform extra checks on the hosts before to run Hyper-V enabled VMs, to make sure the host has no known issues with Hyper-V support, properly expose all the required features and thus we can expect optimal performance. These checks are disabled by default for backward compatibility and because they depend on the node-feature-discovery and on extra configuration.
To enable strict host checking, the user may expand the featureGates field in the KubeVirt CR by adding the HypervStrictCheck to it.
In KubeVirt, a Hook Sidecar container is a sidecar container (a secondary container that runs along with the main application container within the same Pod) used to apply customizations before the Virtual Machine is initialized. This ability is provided since configurable elements in the VMI specification do not cover all of the libvirt domain XML elements.
The sidecar containers communicate with the main container over a socket with a gRPC protocol. There are two main sidecar hooks:
onDefineDomain: This hook helps to customize libvirt's XML and return the new XML over gRPC for the VM creation.
preCloudInitIso: This hook helps to customize the cloud-init configuration. It operates on and returns JSON formatted cloud-init data.
To run a VM with custom modifications, the sidecar-shim-image takes care of implementing the communication with the main container.
The image contains the sidecar-shim binary built using sidecar_shim.go which should be kept as the entrypoint of the container. This binary will search in $PATH for binaries named after the hook names (e.g onDefineDomain and preCloudInitIso) and run them. Users must provide the necessary arguments as command line options (flags).
In the case of onDefineDomain, the arguments will be the VMI information as JSON string, (e.g --vmi vmiJSON) and the current domain XML (e.g --domain domainXML). It outputs the modified domain XML on the standard output.
In the case of preCloudInitIso, the arguments will be the VMI information as JSON string, (e.g --vmi vmiJSON) and the CloudInitData (e.g --cloud-init cloudInitJSON). It outputs the modified CloudInitData (as JSON) on the standard ouput.
Shell or python scripts can be used as alternatives to the binary, by making them available at the expected location (/usr/bin/onDefineDomain or /usr/bin/preCloudInitIso depending upon the hook).
A prebuilt image named sidecar-shim capable of running Shell or Python scripts is shipped as part of KubeVirt releases.
"},{"location":"user_workloads/hook-sidecar/#go-python-shell-pick-any-one","title":"Go, Python, Shell - pick any one","text":"
Although a binary doesn't strictly need to be generated from Go code, and a script doesn't strictly need to be one among Shell or Python, for the purpose of this guide, we will use those as examples.
Example Go code modifiying the SMBIOS system information can be found in the KubeVirt repo. Binary generated from this code, when available under /usr/bin/ondefinedomain in the sidecar-shim-image, is run right before VMI creation and the baseboard manufacturer value is modified to reflect what's provided in the smbios.vm.kubevirt.io/baseBoardManufacturer annotation in VMI spec.
"},{"location":"user_workloads/hook-sidecar/#shell-or-python-script","title":"Shell or Python script","text":"
If you pefer writing a shell or python script instead of a Go program, create a Kubernetes ConfigMap and use annotations to make sure the script is run before the VMI creation. The flow would be as below:
Create a ConfigMap containing the shell or python script you want to run
Create a VMI containing the annotation hooks.kubevirt.io/hookSidecars and mention the ConfigMap information in it.
In this case a predefined image can be used to handle the communication with the main container.
"},{"location":"user_workloads/hook-sidecar/#configmap-with-shell-script","title":"ConfigMap with shell script","text":"
"},{"location":"user_workloads/hook-sidecar/#configmap-with-python-script","title":"ConfigMap with python script","text":"
apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: my-config-map\ndata:\n my_script.sh: |\n #!/usr/bin/env python3\n\n import xml.etree.ElementTree as ET\n import sys\n\n def main(s):\n # write to a temporary file\n f = open(\"/tmp/orig.xml\", \"w\")\n f.write(s)\n f.close()\n\n # parse xml from file\n xml = ET.parse(\"/tmp/orig.xml\")\n # get the root element\n root = xml.getroot()\n # find the baseBoard element\n baseBoard = root.find(\"sysinfo\").find(\"baseBoard\")\n\n # prepare new element to be inserted into the xml definition\n element = ET.Element(\"entry\", {\"name\": \"manufacturer\"})\n element.text = \"Radical Edward\"\n # insert the element\n baseBoard.insert(0, element)\n\n # write to a new file\n xml.write(\"/tmp/new.xml\")\n # print file contents to stdout\n f = open(\"/tmp/new.xml\")\n print(f.read())\n f.close()\n\n if __name__ == \"__main__\":\n main(sys.argv[4])\n
After creating one of the above ConfigMap, create the VMI using the manifest in this example. Of importance here is the ConfigMap information stored in the annotations:
The name field indicates the name of the ConfigMap on the cluster which contains the script you want to execute. The key field indicates the key in the ConfigMap which contains the script to be executed. Finally, hookPath indicates the path where you want the script to be mounted. It could be either of /usr/bin/onDefineDomain or /usr/bin/preCloudInitIso depending upon the hook you want to execute. An optional value can be specified with the \"image\" key if a custom image is needed, if omitted the default Sidecar-shim image built together with the other KubeVirt images will be used. The default Sidecar-shim image, if not override with a custom value, will also be updated as other images as for Updating KubeVirt Workloads.
Whether you used the Go binary or a Shell/Python script from the above examples, you would be able to see the newly created VMI have the modified baseboard manufacturer information. After creating the VMI, verify that it is in the Running state, and connect to its console and see if the desired changes to baseboard manufacturer get reflected:
# Once the VM is ready, connect to its display and login using name and password \"fedora\"\ncluster/virtctl.sh vnc vmi-with-sidecar-hook-configmap\n\n# Check whether the base board manufacturer value was successfully overwritten\nsudo dmidecode -s baseboard-manufacturer\n
"},{"location":"user_workloads/instancetypes/","title":"Instance types and preferences","text":"
FEATURE STATE:
instancetype.kubevirt.io/v1alpha1 (Experimental) as of the v0.56.0 KubeVirt release
instancetype.kubevirt.io/v1alpha2 (Experimental) as of the v0.58.0 KubeVirt release
instancetype.kubevirt.io/v1beta1 as of the v1.0.0 KubeVirt release
KubeVirt's VirtualMachine API contains many advanced options for tuning the performance of a VM that goes beyond what typical users need to be aware of. Users have previously been unable to simply define the storage/network they want assigned to their VM and then declare in broad terms what quality of resources and kind of performance characteristics they need for their VM.
Instance types and preferences provide a way to define a set of resource, performance and other runtime characteristics, allowing users to reuse these definitions across multiple VirtualMachines.
KubeVirt provides two CRDs for instance types, a cluster wide VirtualMachineClusterInstancetype and a namespaced VirtualMachineInstancetype. These CRDs encapsulate the following resource related characteristics of a VirtualMachine through a shared VirtualMachineInstancetypeSpec:
CPU : Required number of vCPUs presented to the guest
Memory : Required amount of memory presented to the guest
GPUs : Optional list of vGPUs to passthrough
HostDevices : Optional list of HostDevices to passthrough
IOThreadsPolicy : Optional IOThreadsPolicy to be used
LaunchSecurity: Optional LaunchSecurity to be used
Anything provided within an instance type cannot be overridden within the VirtualMachine. For example, as CPU and Memory are both required attributes of an instance type, if a user makes any requests for CPU or Memory resources within the underlying VirtualMachine, the instance type will conflict and the request will be rejected during creation.
KubeVirt also provides two further preference based CRDs, again a cluster wide VirtualMachineClusterPreference and namespaced VirtualMachinePreference. These CRDsencapsulate the preferred value of any remaining attributes of a VirtualMachine required to run a given workload, again this is through a shared VirtualMachinePreferenceSpec.
Unlike instance types, preferences only represent the preferred values and as such, they can be overridden by values in the VirtualMachine provided by the user.
In the example shown below, a user has provided a VirtualMachine with a disk bus already defined within a DiskTarget and has also selected a set of preferences with DevicePreference and preferredDiskBus , so the user's original choice within the VirtualMachine and DiskTarget are used:
$ kubectl apply -f - << EOF\n---\napiVersion: instancetype.kubevirt.io/v1beta1\nkind: VirtualMachinePreference\nmetadata:\n name: example-preference-disk-virtio\nspec:\n devices:\n preferredDiskBus: virtio\n---\napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n name: example-preference-user-override\nspec:\n preference:\n kind: VirtualMachinePreference\n name: example-preference-disk-virtio\n runStrategy: Halted\n template:\n spec:\n domain:\n memory:\n guest: 128Mi\n devices:\n disks:\n - disk:\n bus: sata\n name: containerdisk\n - disk: {}\n name: cloudinitdisk\n resources: {}\n terminationGracePeriodSeconds: 0\n volumes:\n - containerDisk:\n image: registry:5000/kubevirt/cirros-container-disk-demo:devel\n name: containerdisk\n - cloudInitNoCloud:\n userData: |\n #!/bin/sh\n\n echo 'printed from cloud-init userdata'\n name: cloudinitdisk\nEOF\nvirtualmachinepreference.instancetype.kubevirt.io/example-preference-disk-virtio created\nvirtualmachine.kubevirt.io/example-preference-user-override configured\n\n\n$ virtctl start example-preference-user-override\nVM example-preference-user-override was scheduled to start\n\n# We can see the original request from the user within the VirtualMachine lists `containerdisk` with a `SATA` bus\n$ kubectl get vms/example-preference-user-override -o json | jq .spec.template.spec.domain.devices.disks\n[\n {\n \"disk\": {\n \"bus\": \"sata\"\n },\n \"name\": \"containerdisk\"\n },\n {\n \"disk\": {},\n \"name\": \"cloudinitdisk\"\n }\n]\n\n# This is still the case in the VirtualMachineInstance with the remaining disk using the `preferredDiskBus` from the preference of `virtio`\n$ kubectl get vmis/example-preference-user-override -o json | jq .spec.domain.devices.disks\n[\n {\n \"disk\": {\n \"bus\": \"sata\"\n },\n \"name\": \"containerdisk\"\n },\n {\n \"disk\": {\n \"bus\": \"virtio\"\n },\n \"name\": \"cloudinitdisk\"\n }\n]\n
A preference can optionally include a PreferredCPUTopology that defines how the guest visible CPU topology of the VirtualMachineInstance is constructed from vCPUs supplied by an instance type.
The allowed values for PreferredCPUTopology include:
sockets (default) - Provides vCPUs as sockets to the guest
cores - Provides vCPUs as cores to the guest
threads - Provides vCPUs as threads to the guest
spread - Spreads vCPUs across sockets and cores by default. See the following SpreadOptions section for more details.
any - Provides vCPUs as sockets to the guest, this is also used to express that any allocation of vCPUs is required by the preference. Useful when defining a preference that isn't used alongside an instance type.
Note that support for the original preferSockets, preferCores, preferThreads and preferSpread values for PreferredCPUTopology is deprecated as of v1.4.0 ahead of removal in a future release.
When spread is provided as the value of PreferredCPUTopology we can further customize how vCPUs are spread across the guest visible CPU topology using SpreadOptions:
The previous instance type and preference CRDs are matched to a given VirtualMachine through the use of a matcher. Each matcher consists of the following:
Name (string): Name of the resource being referenced
Kind (string): Optional, defaults to the cluster wide CRD kinds of VirtualMachineClusterInstancetype or VirtualMachineClusterPreference if not provided
RevisionName (string) : Optional, name of a ControllerRevision containing a copy of the VirtualMachineInstancetypeSpec or VirtualMachinePreferenceSpec taken when the VirtualMachine is first created. See the Versioning section below for more details on how and why this is captured.
InferFromVolume (string): Optional, see the Inferring defaults from a Volume section below for more details.
"},{"location":"user_workloads/instancetypes/#creating-instancetypes-preferences-and-virtualmachines","title":"Creating InstanceTypes, Preferences and VirtualMachines","text":"
It is possible to streamline the creation of instance types, preferences, and virtual machines with the usage of the virtctl command-line tool. To read more about it, please see the Creating VirtualMachines.
Versioning of these resources is required to ensure the eventual VirtualMachineInstance created when starting a VirtualMachine does not change between restarts if any referenced instance type or set of preferences are updated during the lifetime of the VirtualMachine.
This is currently achieved by using ControllerRevision to retain a copy of the VirtualMachineInstancetype or VirtualMachinePreference at the time the VirtualMachine is created. A reference to these ControllerRevisions are then retained in the InstancetypeMatcher and PreferenceMatcher within the VirtualMachine for future use.
Users can opt in to moving to a newer generation of an instance type or preference by removing the referenced revisionName from the appropriate matcher within the VirtualMachine object. This will result in fresh ControllerRevisions being captured and used.
The following example creates a VirtualMachine using an initial version of the csmall instance type before increasing the number of vCPUs provided by the instance type:
In order for this change to be picked up within the VirtualMachine, we need to stop the running VirtualMachine and clear the revisionName referenced by the InstancetypeMatcher:
As you can see above, the InstancetypeMatcher now references a new ControllerRevision containing generation 2 of the instance type. We can now start the VirtualMachine again and see the new number of vCPUs being used by the VirtualMachineInstance:
$ virtctl start vm-cirros-csmall\nVM vm-cirros-csmall was scheduled to start\n\n$ kubectl get vmi/vm-cirros-csmall -o json | jq .spec.domain.cpu\n{\n \"cores\": 1,\n \"model\": \"host-model\",\n \"sockets\": 2,\n \"threads\": 1\n}\n
The inferFromVolume attribute of both the InstancetypeMatcher and PreferenceMatcher allows a user to request that defaults are inferred from a volume. When requested, KubeVirt will look for the following labels on the underlying PVC, DataSource or DataVolume to determine the default name and kind:
instancetype.kubevirt.io/default-instancetype
instancetype.kubevirt.io/default-instancetype-kind (optional, defaults to VirtualMachineClusterInstancetype)
instancetype.kubevirt.io/default-preference
instancetype.kubevirt.io/default-preference-kind (optional, defaults to VirtualMachineClusterPreference)
These values are then written into the appropriate matcher by the mutation webhook and used during validation before the VirtualMachine is formally accepted.
The validation can be controlled by the value provided to inferFromVolumeFailurePolicy in either the InstancetypeMatcher or PreferenceMatcher of a VirtualMachine.
The default value of Reject will cause the request to be rejected on failure to find the referenced Volume or labels on an underlying resource.
If Ignore was provided, the respective InstancetypeMatcher or PreferenceMatcher will be cleared on a failure instead.
Various examples are available within the kubevirt repo under /examples. The following uses an example VirtualMachine provided by the containerdisk/fedora repo and replaces much of the DomainSpec with the equivalent instance type and preferences:
This version captured complete VirtualMachine{Instancetype,ClusterInstancetype,Preference,ClusterPreference} objects within the created ControllerRevisions
This version is backwardly compatible with instancetype.kubevirt.io/v1alpha1.
The following instance type attribute has been added:
Spec.Memory.OvercommitPercent
The following preference attributes have been added:
Spec.CPU.PreferredCPUFeatures
Spec.Devices.PreferredInterfaceMasquerade
Spec.PreferredSubdomain
Spec.PreferredTerminationGracePeriodSeconds
Spec.Requirements
This version is backwardly compatible with instancetype.kubevirt.io/v1alpha1 and instancetype.kubevirt.io/v1alpha2 objects, no modifications are required to existing VirtualMachine{Instancetype,ClusterInstancetype,Preference,ClusterPreference} or ControllerRevisions.
As with the migration to kubevirt.io/v1 it is recommend previous users of instancetype.kubevirt.io/v1alpha1 or instancetype.kubevirt.io/v1alpha2 use kube-storage-version-migrator to upgrade any stored objects to instancetype.kubevirt.io/v1beta1.
Every VirtualMachineInstance represents a single virtual machine instance. In general, the management of VirtualMachineInstances is kept similar to how Pods are managed: Every VM that is defined in the cluster is expected to be running, just like Pods. Deleting a VirtualMachineInstance is equivalent to shutting it down, this is also equivalent to how Pods behave.
"},{"location":"user_workloads/lifecycle/#launching-a-virtual-machine","title":"Launching a virtual machine","text":"
In order to start a VirtualMachineInstance, you just need to create a VirtualMachineInstance object using kubectl:
Note: Stopping a VirtualMachineInstance implies that it will be deleted from the cluster. You will not be able to start this VirtualMachineInstance object again.
"},{"location":"user_workloads/lifecycle/#starting-and-stopping-a-virtual-machine","title":"Starting and stopping a virtual machine","text":"
Virtual machines, in contrast to VirtualMachineInstances, have a running state. Thus on VM you can define if it should be running, or not. VirtualMachineInstances are, if they are defined in the cluster, always running and consuming resources.
virtctl is used in order to start and stop a VirtualMachine:
$ virtctl start my-vm\n$ virtctl stop my-vm\n
Note: You can force stop a VM (which is like pulling the power cord, with all its implications like data inconsistencies or [in the worst case] data loss) by
$ virtctl stop my-vm --grace-period 0 --force\n
"},{"location":"user_workloads/lifecycle/#pausing-and-unpausing-a-virtual-machine","title":"Pausing and unpausing a virtual machine","text":"
Note: Pausing in this context refers to libvirt's virDomainSuspend command: \"The process is frozen without further access to CPU resources and I/O but the memory used by the domain at the hypervisor level will stay allocated\"
To pause a virtual machine, you need the virtctl command line tool. Its pause command works on either VirtualMachine s or VirtualMachinesInstance s:
$ virtctl pause vm testvm\n# OR\n$ virtctl pause vmi testvm\n
Paused VMIs have a Paused condition in their status:
$ kubectl get vmi testvm -o=jsonpath='{.status.conditions[?(@.type==\"Paused\")].message}'\nVMI was paused by user\n
Unpausing works similar to pausing:
$ virtctl unpause vm testvm\n# OR\n$ virtctl unpause vmi testvm\n
"},{"location":"user_workloads/liveness_and_readiness_probes/","title":"Liveness and Readiness Probes","text":"
It is possible to configure Liveness and Readiness Probes in a similar fashion like it is possible to configure Liveness and Readiness Probes on Containers.
Liveness Probes will effectively stop the VirtualMachineInstance if they fail, which will allow higher level controllers, like VirtualMachine or VirtualMachineInstanceReplicaSet to spawn new instances, which will hopefully be responsive again.
Readiness Probes are an indicator for Services and Endpoints if the VirtualMachineInstance is ready to receive traffic from Services. If Readiness Probes fail, the VirtualMachineInstance will be removed from the Endpoints which back services until the probe recovers.
Watchdogs focus on ensuring that an Operating System is still responsive. They complement the probes which are more workload centric. Watchdogs require kernel support from the guest and additional tooling like the commonly used watchdog binary.
Exec probes are Liveness or Readiness probes specifically intended for VMs. These probes run a command inside the VM and determine the VM ready/live state based on its success. For running commands inside the VMs, the qemu-guest-agent package is used. A command supplied to an exec probe will be wrapped by virt-probe in the operator and forwarded to the guest.
"},{"location":"user_workloads/liveness_and_readiness_probes/#define-a-http-liveness-probe","title":"Define a HTTP Liveness Probe","text":"
The following VirtualMachineInstance configures a HTTP Liveness Probe via spec.livenessProbe.httpGet, which will query port 1500 of the VirtualMachineInstance, after an initial delay of 120 seconds. The VirtualMachineInstance itself installs and runs a minimal HTTP server on port 1500 via cloud-init.
"},{"location":"user_workloads/liveness_and_readiness_probes/#define-a-tcp-liveness-probe","title":"Define a TCP Liveness Probe","text":"
The following VirtualMachineInstance configures a TCP Liveness Probe via spec.livenessProbe.tcpSocket, which will query port 1500 of the VirtualMachineInstance, after an initial delay of 120 seconds. The VirtualMachineInstance itself installs and runs a minimal HTTP server on port 1500 via cloud-init.
Note that in the case of Readiness Probes, it is also possible to set a failureThreshold and a successThreashold to only flip between ready and non-ready state if the probe succeeded or failed multiple times.
Some context is needed to understand the limitations imposed by a dual-stack network configuration on readiness - or liveness - probes. Users must be fully aware that a dual-stack configuration is currently only available when using a masquerade binding type. Furthermore, it must be recalled that accessing a VM using masquerade binding type is performed via the pod IP address; in dual-stack mode, both IPv4 and IPv6 addresses can be used to reach the VM.
Dual-stack networking configurations have a limitation when using HTTP / TCP probes - you cannot probe the VMI by its IPv6 address. The reason for this is the host field for both the HTTP and TCP probe actions default to the pod's IP address, which is currently always the IPv4 address.
Since the pod's IP address is not known before creating the VMI, it is not possible to pre-provision the probe's host field.
"},{"location":"user_workloads/liveness_and_readiness_probes/#defining-a-watchdog","title":"Defining a Watchdog","text":"
A watchdog is a more VM centric approach where the responsiveness of the Operating System is focused on. One can configure the i6300esb watchdog device:
The example above configures it with the poweroff action. It defines what will happen if the OS can't respond anymore. Other possible actions are reset and shutdown. The VM in this example will have the device exposed as /dev/watchdog. This device can then be used by the watchdog binary. For example, if root executes this command inside the VM:
the watchdog will send a heartbeat every two seconds to /dev/watchdog and after four seconds without a heartbeat the defined action will be executed. In this case a hard poweroff.
Guest-Agent probes are based on qemu-guest-agent guest-ping. This will ping the guest and return an error if the guest is not up and running. To easily define this on VM spec, specify guestAgentPing: {} in VM's spec.template.spec.readinessProbe. virt-controller will translate this into a corresponding command wrapped by virt-probe.
Note: You can only define one of the type of probe, i.e. guest-agent exec or ping probes.
Important: If the qemu-guest-agent is not installed and enabled inside the VM, the probe will fail. Many images don't enable the agent by default so make sure you either run one that does or enable it.
Make sure to provide enough delay and failureThreshold for the VM and the agent to be online.
In the following example the Fedora image does have qemu-guest-agent available by default. Nevertheless, in case qemu-guest-agent is not installed, it will be installed and enabled via cloud-init as shown in the example below. Also, cloud-init assigns the proper SELinux context, i.e. virt_qemu_ga_exec_t, to the /tmp/healthy.txt file. Otherwise, SELinux will deny the attempts to open the /tmp/healthy.txt file causing the probe to fail.
Note that, in the above example if SELinux is not installed in your container disk image, the command chcon should be removed from the VM manifest shown below. Otherwise, the chcon command will fail.
The .status.ready field will switch to true indicating that probes are returning successfully:
A VirtualMachinePool tries to ensure that a specified number of VirtualMachine replicas and their respective VirtualMachineInstances are in the ready state at any time. In other words, a VirtualMachinePool makes sure that a VirtualMachine or a set of VirtualMachines is always up and ready.
No state is kept and no guarantees are made about the maximum number of VirtualMachineInstance replicas running at any time. For example, the VirtualMachinePool may decide to create new replicas if possibly still running VMs are entering an unknown state.
The VirtualMachinePool allows us to specify a VirtualMachineTemplate in spec.virtualMachineTemplate. It consists of ObjectMetadata in spec.virtualMachineTemplate.metadata, and a VirtualMachineSpec in spec.virtualMachineTemplate.spec. The specification of the virtual machine is equal to the specification of the virtual machine in the VirtualMachine workload.
spec.replicas can be used to specify how many replicas are wanted. If unspecified, the default value is 1. This value can be updated anytime. The controller will react to the changes.
spec.selector is used by the controller to keep track of managed virtual machines. The selector specified there must be able to match the virtual machine labels as specified in spec.virtualMachineTemplate.metadata.labels. If the selector does not match these labels, or they are empty, the controller will simply do nothing except log an error. The user is responsible for avoiding the creation of other virtual machines or VirtualMachinePools which may conflict with the selector and the template labels.
"},{"location":"user_workloads/pool/#creating-a-virtualmachinepool","title":"Creating a VirtualMachinePool","text":"
VirtualMachinePool is part of the Kubevirt API pool.kubevirt.io/v1alpha1.
The example below shows how to create a simple VirtualMachinePool:
Saving this manifest into vm-pool-cirros.yaml and submitting it to Kubernetes will create three virtual machines based on the template.
$ kubectl create -f vm-pool-cirros.yaml\nvirtualmachinepool.pool.kubevirt.io/vm-pool-cirros created\n$ kubectl describe vmpool vm-pool-cirros\nName: vm-pool-cirros\nNamespace: default\nLabels: <none>\nAnnotations: <none>\nAPI Version: pool.kubevirt.io/v1alpha1\nKind: VirtualMachinePool\nMetadata:\n Creation Timestamp: 2023-02-09T18:30:08Z\n Generation: 1\n Manager: kubectl-create\n Operation: Update\n Time: 2023-02-09T18:30:08Z\n API Version: pool.kubevirt.io/v1alpha1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n .:\n f:labelSelector:\n f:readyReplicas:\n f:replicas:\n Manager: virt-controller\n Operation: Update\n Subresource: status\n Time: 2023-02-09T18:30:44Z\n Resource Version: 6606\n UID: ba51daf4-f99f-433c-89e5-93f39bc9989d\nSpec:\n Replicas: 3\n Selector:\n Match Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Virtual Machine Template:\n Metadata:\n Creation Timestamp: <nil>\n Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Spec:\n Running: true\n Template:\n Metadata:\n Creation Timestamp: <nil>\n Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Spec:\n Domain:\n Devices:\n Disks:\n Disk:\n Bus: virtio\n Name: containerdisk\n Resources:\n Requests:\n Memory: 128Mi\n Termination Grace Period Seconds: 0\n Volumes:\n Container Disk:\n Image: kubevirt/cirros-container-disk-demo:latest\n Name: containerdisk\nStatus:\n Label Selector: kubevirt.io/vmpool=vm-pool-cirros\n Ready Replicas: 2\n Replicas: 3\nEvents:\n Type Reason Age From Message\n ---- ------ ---- ---- -------\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-0\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-2\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-1\n
Replicas is 3 and Ready Replicas is 2. This means that at the moment when showing the status, three Virtual Machines were already created, but only two are running and ready.
"},{"location":"user_workloads/pool/#scaling-via-the-scale-subresource","title":"Scaling via the Scale Subresource","text":"
Note: This requires KubeVirt 0.59 or newer.
The VirtualMachinePool supports the scale subresource. As a consequence it is possible to scale it via kubectl:
"},{"location":"user_workloads/pool/#removing-a-virtualmachine-from-virtualmachinepool","title":"Removing a VirtualMachine from VirtualMachinePool","text":"
It is also possible to remove a VirtualMachine from its VirtualMachinePool.
In this scenario, the ownerReferences needs to be removed from the VirtualMachine. This can be achieved either by using kubectl edit or kubectl patch. Using kubectl patch it would look like:
kubectl patch vm vm-pool-cirros-0 --type merge --patch '{\"metadata\":{\"ownerReferences\":null}}'\n
Note: You may want to update your VirtualMachine labels as well to avoid impact on selectors.
"},{"location":"user_workloads/pool/#using-the-horizontal-pod-autoscaler","title":"Using the Horizontal Pod Autoscaler","text":"
Note: This requires KubeVirt 0.59 or newer.
The HorizontalPodAutoscaler (HPA) can be used with a VirtualMachinePool. Simply reference it in the spec of the autoscaler:
"},{"location":"user_workloads/pool/#exposing-a-virtualmachinepool-as-a-service","title":"Exposing a VirtualMachinePool as a Service","text":"
A VirtualMachinePool may be exposed as a service. When this is done, one of the VirtualMachine replicas will be picked for the actual delivery of the service.
For example, exposing SSH port (22) as a ClusterIP service:
Saving this manifest into vm-pool-cirros-ssh.yaml and submitting it to Kubernetes will create the ClusterIP service listening on port 2222 and forwarding to port 22.
Usage of a DataVolumeTemplates within a spec.virtualMachineTemplate.spec will result in the creation of unique persistent storage for each VM within a VMPool. The DataVolumeTemplate name will have the VM's sequential postfix appended to it when the VM is created from the spec.virtualMachineTemplate.spec.dataVolumeTemplates. This makes each VM a completely unique stateful workload.
"},{"location":"user_workloads/pool/#using-unique-cloudinit-and-configmap-volumes-with-virtualmachinepools","title":"Using Unique CloudInit and ConfigMap Volumes with VirtualMachinePools","text":"
By default, any secrets or configMaps references in a spec.virtualMachineTemplate.spec.template Volume section will be used directly as is, without any modification to the naming. This means if you specify a secret in a CloudInitNoCloud volume, that every VM instance spawned from the VirtualMachinePool with this volume will get the exact same secret used for their cloud-init user data.
This default behavior can be modified by setting the AppendPostfixToSecretReferences and AppendPostfixToConfigMapReferences booleans to true on the VMPool spec. When these booleans are enabled, references to secret and configMap names will have the VM's sequential postfix appended to the secret and configmap name. This allows someone to pre-generate unique per VM secret and configMap data for a VirtualMachinePool ahead of time in a way that will be predictably assigned to VMs within the VirtualMachinePool.
VirtualMachineInstancePresets are deprecated as of the v0.57.0 release and will be removed in a future release.
Users should instead look to use Instancetypes and preferences as a replacement.
VirtualMachineInstancePresets are an extension to general VirtualMachineInstance configuration behaving much like PodPresets from Kubernetes. When a VirtualMachineInstance is created, any applicable VirtualMachineInstancePresets will be applied to the existing spec for the VirtualMachineInstance. This allows for re-use of common settings that should apply to multiple VirtualMachineInstances.
"},{"location":"user_workloads/presets/#create-a-virtualmachineinstancepreset","title":"Create a VirtualMachineInstancePreset","text":"
You can describe a VirtualMachineInstancePreset in a YAML file. For example, the vmi-preset.yaml file below describes a VirtualMachineInstancePreset that requests a VirtualMachineInstance be created with a resource request for 64M of RAM.
As with most Kubernetes resources, a VirtualMachineInstancePreset requires apiVersion, kind and metadata fields.
Additionally VirtualMachineInstancePresets also need a spec section. While not technically required to satisfy syntax, it is strongly recommended to include a Selector in the spec section, otherwise a VirtualMachineInstancePreset will match all VirtualMachineInstances in a namespace.
KubeVirt uses Kubernetes Labels and Selectors to determine which VirtualMachineInstancePresets apply to a given VirtualMachineInstance, similarly to how PodPresets work in Kubernetes. If a setting from a VirtualMachineInstancePreset is applied to a VirtualMachineInstance, the VirtualMachineInstance will be marked with an Annotation upon completion.
Any domain structure can be listed in the spec of a VirtualMachineInstancePreset, e.g. Clock, Features, Memory, CPU, or Devices such as network interfaces. All elements of the spec section of a VirtualMachineInstancePreset will be applied to the VirtualMachineInstance.
Once a VirtualMachineInstancePreset is successfully applied to a VirtualMachineInstance, the VirtualMachineInstance will be marked with an annotation to indicate that it was applied. If a conflict occurs while a VirtualMachineInstancePreset is being applied, that portion of the VirtualMachineInstancePreset will be skipped.
Any valid Label can be matched against, but it is suggested that a general rule of thumb is to use os/shortname, e.g. kubevirt.io/os: rhel7.
"},{"location":"user_workloads/presets/#updating-a-virtualmachineinstancepreset","title":"Updating a VirtualMachineInstancePreset","text":"
If a VirtualMachineInstancePreset is modified, changes will not be applied to existing VirtualMachineInstances. This applies to both the Selector indicating which VirtualMachineInstances should be matched, and also the Domain section which lists the settings that should be applied to a VirtualMachine.
VirtualMachineInstancePresets use a similar conflict resolution strategy to Kubernetes PodPresets. If a portion of the domain spec is present in both a VirtualMachineInstance and a VirtualMachineInstancePreset and both resources have the identical information, then creation of the VirtualMachineInstance will continue normally. If however there is a difference between the resources, an Event will be created indicating which DomainSpec element of which VirtualMachineInstancePreset was overridden. For example: If both the VirtualMachineInstance and VirtualMachineInstancePreset define a CPU, but use a different number of Cores, KubeVirt will note the difference.
If any settings from the VirtualMachineInstancePreset were successfully applied, the VirtualMachineInstance will be annotated.
In the event that there is a difference between the Domains of a VirtualMachineInstance and VirtualMachineInstancePreset, KubeVirt will create an Event. kubectl get events can be used to show all Events. For example:
$ kubectl get events\n ....\n Events:\n FirstSeen LastSeen Count From SubobjectPath Reason Message\n 2m 2m 1 myvmi.1515bbb8d397f258 VirtualMachineInstance Warning Conflict virtualmachineinstance-preset-controller Unable to apply VirtualMachineInstancePreset 'example-preset': spec.cpu: &{6} != &{4}\n
When multiple VirtualMachineInstancePresets match a particular VirtualMachineInstance, if they specify the same settings within a Domain, those settings must match. If two VirtualMachineInstancePresets have conflicting settings (e.g. for the number of CPU cores requested), an error will occur, and the VirtualMachineInstance will enter the Failed state, and a Warning event will be emitted explaining which settings of which VirtualMachineInstancePresets were problematic.
The main use case for VirtualMachineInstancePresets is to create re-usable settings that can be applied across various machines. Multiple methods are available to match the labels of a VirtualMachineInstance using selectors.
matchLabels: Each VirtualMachineInstance can use a specific label shared by all
instances. * matchExpressions: Logical operators for sets can be used to match multiple
labels.
Using matchLabels, the label used in the VirtualMachineInstancePreset must match one of the labels of the VirtualMachineInstance:
Since VirtualMachineInstancePresets use Selectors that indicate which VirtualMachineInstances their settings should apply to, there needs to exist a mechanism by which VirtualMachineInstances can opt out of VirtualMachineInstancePresets altogether. This is done using an annotation:
This is an example of a merge conflict. In this case both the VirtualMachineInstance and VirtualMachineInstancePreset request different number of CPU's.
"},{"location":"user_workloads/presets/#matching-multiple-virtualmachineinstances-using-matchlabels","title":"Matching Multiple VirtualMachineInstances Using MatchLabels","text":"
These VirtualMachineInstances have multiple labels, one that is unique and one that is shared.
Note: This example breaks from the convention of using os-shortname as a Label for demonstration purposes.
"},{"location":"user_workloads/presets/#matching-multiple-virtualmachineinstances-using-matchexpressions","title":"Matching Multiple VirtualMachineInstances Using MatchExpressions","text":"
This VirtualMachineInstancePreset has a matchExpression that will match two labels: kubevirt.io/os: win10 and kubevirt.io/os: win7.
A VirtualMachineInstanceReplicaSet tries to ensures that a specified number of VirtualMachineInstance replicas are running at any time. In other words, a VirtualMachineInstanceReplicaSet makes sure that a VirtualMachineInstance or a homogeneous set of VirtualMachineInstances is always up and ready. It is very similar to a Kubernetes ReplicaSet.
No state is kept and no guarantees about the maximum number of VirtualMachineInstance replicas which are up are given. For example, the VirtualMachineInstanceReplicaSet may decide to create new replicas if possibly still running VMs are entering an unknown state.
The VirtualMachineInstanceReplicaSet allows us to specify a VirtualMachineInstanceTemplate in spec.template. It consists of ObjectMetadata in spec.template.metadata, and a VirtualMachineInstanceSpec in spec.template.spec. The specification of the virtual machine is equal to the specification of the virtual machine in the VirtualMachineInstance workload.
spec.replicas can be used to specify how many replicas are wanted. If unspecified, the default value is 1. This value can be updated anytime. The controller will react to the changes.
spec.selector is used by the controller to keep track of managed virtual machines. The selector specified there must be able to match the virtual machine labels as specified in spec.template.metadata.labels. If the selector does not match these labels, or they are empty, the controller will simply do nothing except from logging an error. The user is responsible for not creating other virtual machines or VirtualMachineInstanceReplicaSets which conflict with the selector and the template labels.
"},{"location":"user_workloads/replicaset/#exposing-a-virtualmachineinstancereplicaset-as-a-service","title":"Exposing a VirtualMachineInstanceReplicaSet as a Service","text":"
A VirtualMachineInstanceReplicaSet could be exposed as a service. When this is done, one of the VirtualMachineInstances replicas will be picked for the actual delivery of the service.
For example, exposing SSH port (22) as a ClusterIP service using virtctl on a VirtualMachineInstanceReplicaSet:
All service exposure options that apply to a VirtualMachineInstance apply to a VirtualMachineInstanceReplicaSet. See Exposing VirtualMachineInstance for more details.
"},{"location":"user_workloads/replicaset/#when-to-use-a-virtualmachineinstancereplicaset","title":"When to use a VirtualMachineInstanceReplicaSet","text":"
Note: The base assumption is that referenced disks are read-only or that the VMIs are writing internally to a tmpfs. The most obvious volume sources for VirtualMachineInstanceReplicaSets which KubeVirt supports are referenced below. If other types are used data corruption is possible.
Using VirtualMachineInstanceReplicaSet is the right choice when one wants many identical VMs and does not care about maintaining any disk state after the VMs are terminated.
Volume types which work well in combination with a VirtualMachineInstanceReplicaSet are:
cloudInitNoCloud
ephemeral
containerDisk
emptyDisk
configMap
secret
any other type, if the VMI writes internally to a tmpfs
This use-case involves small and fast booting VMs with little provisioning performed during initialization.
In this scenario, migrations are not important. Redistributing VM workloads between Nodes can be achieved simply by deleting managed VirtualMachineInstances which are running on an overloaded Node. The eviction of such a VirtualMachineInstance can happen by directly deleting the VirtualMachineInstance instance (KubeVirt aware workload redistribution) or by deleting the corresponding Pod where the Virtual Machine runs in (Only Kubernetes aware workload redistribution).
In this use-case one has big and slow booting VMs, and complex or resource intensive provisioning is done during boot. More specifically, the timespan between the creation of a new VM and it entering the ready state is long.
In this scenario, one still does not care about the state, but since re-provisioning VMs is expensive, migrations are important. Workload redistribution between Nodes can be achieved by migrating VirtualMachineInstances to different Nodes. A workload redistributor needs to be aware of KubeVirt and create migrations, instead of evicting VirtualMachineInstances by deletion.
Note: The simplest form of having a migratable ephemeral VirtualMachineInstance, will be to use local storage based on ContainerDisks in combination with a file based backing store. However, migratable backing store support has not officially landed yet in KubeVirt and is untested.
Replicas is 3 and Ready Replicas is 2. This means that at the moment when showing the status, three Virtual Machines were already created, but only two are running and ready.
"},{"location":"user_workloads/replicaset/#scaling-via-the-scale-subresource","title":"Scaling via the Scale Subresource","text":"
Note: This requires the CustomResourceSubresources feature gate to be enabled for clusters prior to 1.11.
The VirtualMachineInstanceReplicaSet supports the scale subresource. As a consequence it is possible to scale it via kubectl:
$ kubectl scale vmirs myvmirs --replicas 5\n
"},{"location":"user_workloads/replicaset/#using-the-horizontal-pod-autoscaler","title":"Using the Horizontal Pod Autoscaler","text":"
Note: This requires at cluster newer or equal to 1.11.
The HorizontalPodAutoscaler (HPA) can be used with a VirtualMachineInstanceReplicaSet. Simply reference it in the spec of the autoscaler:
KubeVirt supports the ability to assign a startup script to a VirtualMachineInstance instance which is executed automatically when the VM initializes.
These scripts are commonly used to automate injection of users and SSH keys into VMs in order to provide remote access to the machine. For example, a startup script can be used to inject credentials into a VM that allows an Ansible job running on a remote host to access and provision the VM.
Startup scripts are not limited to any specific use case though. They can be used to run any arbitrary script in a VM on boot.
cloud-init is a widely adopted project used for early initialization of a VM. Used by cloud providers such as AWS and GCP, cloud-init has established itself as the defacto method of providing startup scripts to VMs.
Cloud-init documentation can be found here: Cloud-init Documentation.
KubeVirt supports cloud-init's NoCloud and ConfigDrive datasources which involve injecting startup scripts into a VM instance through the use of an ephemeral disk. VMs with the cloud-init package installed will detect the ephemeral disk and execute custom userdata scripts at boot.
Ignition is an alternative to cloud-init which allows for configuring the VM disk on first boot. You can find the Ignition documentation here. You can also find a comparison between cloud-init and Ignition here.
Ignition can be used with Kubevirt by using the cloudInitConfigDrive volume.
We need to make sure the base vm does not restart, which can be done by setting the vm run strategy as RerunOnFailure.
VM runStrategy:
spec:\n runStrategy: RerunOnFailure\n
More information can be found here:
Sysprep Process Overview
Sysprep (Generalize) a Windows installation
Note
It is important that there is no answer file detected when the Sysprep Tool is triggered, because Windows Setup searches for answer files at the beginning of each configuration pass and caches it. If that happens, when the OS will start - it will just use the cached answer file, ignoring the one we provide through the Sysprep API. More information can be found here.
Providing an Answer file named autounattend.xml in an attached media. The answer file can be provided in a ConfigMap or a Secret with the key autounattend.xml
The configuration file can be generated with Windows SIM or it can be specified manually according to the information found here:
Answer files (unattend.xml)
Answer File Reference
Answer File Components Reference
Note
There are also many easy to find online tools available for creating an answer file.
KubeVirt supports the cloud-init NoCloud and ConfigDrive data sources which involve injecting startup scripts through the use of a disk attached to the VM.
In order to assign a custom userdata script to a VirtualMachineInstance using this method, users must define a disk and a volume for the NoCloud or ConfigDrive datasource in the VirtualMachineInstance's spec.
Under most circumstances users should stick to the NoCloud data source as it is the simplest cloud-init data source. Only if NoCloud is not supported by the cloud-init implementation (e.g. coreos-cloudinit) users should switch the data source to ConfigDrive.
Switching the cloud-init data source to ConfigDrive is as easy as changing the volume type in the VirtualMachineInstance's spec from cloudInitNoCloud to cloudInitConfigDrive.
Note The MAC address of the secondary interface should be predefined and identical in the network interface and the cloud-init networkData.
See the examples below for more complete cloud-init examples.
"},{"location":"user_workloads/startup_scripts/#cloud-init-user-data-as-clear-text","title":"Cloud-init user-data as clear text","text":"
In the example below, a SSH key is stored in the cloudInitNoCloud Volume's userData field as clean text. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
# Create a VM manifest with the startup script\n# a cloudInitNoCloud volume's userData field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userData: |\n #cloud-config\n ssh_authorized_keys:\n - ssh-rsa AAAAB3NzaK8L93bWxnyp test@test.com\n\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-user-data-as-base64-string","title":"Cloud-init user-data as base64 string","text":"
In the example below, a simple bash script is base64 encoded and stored in the cloudInitNoCloud Volume's userDataBase64 field. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
Users also have the option of storing the startup script in a Kubernetes Secret and referencing the Secret in the VM's spec. Examples further down in the document illustrate how that is done.
# Create a simple startup script\n\ncat << END > startup-script.sh\n#!/bin/bash\necho \"Hi from startup script!\"\nEND\n\n# Create a VM manifest with the startup script base64 encoded into\n# a cloudInitNoCloud volume's userDataBase64 field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script.sh | base64 -w0)\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-userdata-as-k8s-secret","title":"Cloud-init UserData as k8s Secret","text":"
Users who wish to not store the cloud-init userdata directly in the VirtualMachineInstance spec have the option to store the userdata into a Kubernetes Secret and reference that Secret in the spec.
Multiple VirtualMachineInstance specs can reference the same Kubernetes Secret containing cloud-init userdata.
Below is an example of how to create a Kubernetes Secret containing a startup script and reference that Secret in the VM's spec.
# Create a simple startup script\n\ncat << END > startup-script.sh\n#!/bin/bash\necho \"Hi from startup script!\"\nEND\n\n# Store the startup script in a Kubernetes Secret\nkubectl create secret generic my-vmi-secret --from-file=userdata=startup-script.sh\n\n# Create a VM manifest and reference the Secret's name in the cloudInitNoCloud\n# Volume's secretRef field\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n secretRef:\n name: my-vmi-secret\nEND\n\n# Post the VM\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#injecting-ssh-keys-with-cloud-inits-cloud-config","title":"Injecting SSH keys with Cloud-init's Cloud-config","text":"
In the examples so far, the cloud-init userdata script has been a bash script. Cloud-init has it's own configuration that can handle some common tasks such as user creation and SSH key injection.
More cloud-config examples can be found here: Cloud-init Examples
Below is an example of using cloud-config to inject an SSH key for the default user (fedora in this case) of a Fedora Atomic disk image.
# Create the cloud-init cloud-config userdata.\ncat << END > startup-script\n#cloud-config\npassword: atomic\nchpasswd: { expire: False }\nssh_pwauth: False\nssh_authorized_keys:\n - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6zdgFiLr1uAK7PdcchDd+LseA5fEOcxCCt7TLlr7Mx6h8jUg+G+8L9JBNZuDzTZSF0dR7qwzdBBQjorAnZTmY3BhsKcFr8Gt4KMGrS6r3DNmGruP8GORvegdWZuXgASKVpXeI7nCIjRJwAaK1x+eGHwAWO9Z8ohcboHbLyffOoSZDSIuk2kRIc47+ENRjg0T6x2VRsqX27g6j4DfPKQZGk0zvXkZaYtr1e2tZgqTBWqZUloMJK8miQq6MktCKAS4VtPk0k7teQX57OGwD6D7uo4b+Cl8aYAAwhn0hc0C2USfbuVHgq88ESo2/+NwV4SQcl3sxCW21yGIjAGt4Hy7J fedora@localhost.localdomain\nEND\n\n# Create the VM spec\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: sshvmi\nspec:\n terminationGracePeriodSeconds: 0\n domain:\n resources:\n requests:\n memory: 1024M\n devices:\n disks:\n - name: containerdisk\n disk:\n dev: vda\n - name: cloudinitdisk\n disk:\n dev: vdb\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/fedora-atomic-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script | base64 -w0)\nEND\n\n# Post the VirtualMachineInstance spec to KubeVirt.\nkubectl create -f my-vmi.yaml\n\n# Connect to VM with passwordless SSH key\nssh -i <insert private key here> fedora@<insert ip here>\n
"},{"location":"user_workloads/startup_scripts/#inject-ssh-key-using-a-custom-shell-script","title":"Inject SSH key using a Custom Shell Script","text":"
Depending on the boot image in use, users may have a mixed experience using cloud-init's cloud-config to create users and inject SSH keys.
Below is an example of creating a user and injecting SSH keys for that user using a script instead of cloud-config.
cat << END > startup-script.sh\n#!/bin/bash\nexport NEW_USER=\"foo\"\nexport SSH_PUB_KEY=\"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6zdgFiLr1uAK7PdcchDd+LseA5fEOcxCCt7TLlr7Mx6h8jUg+G+8L9JBNZuDzTZSF0dR7qwzdBBQjorAnZTmY3BhsKcFr8Gt4KMGrS6r3DNmGruP8GORvegdWZuXgASKVpXeI7nCIjRJwAaK1x+eGHwAWO9Z8ohcboHbLyffOoSZDSIuk2kRIc47+ENRjg0T6x2VRsqX27g6j4DfPKQZGk0zvXkZaYtr1e2tZgqTBWqZUloMJK8miQq6MktCKAS4VtPk0k7teQX57OGwD6D7uo4b+Cl8aYAAwhn0hc0C2USfbuVHgq88ESo2/+NwV4SQcl3sxCW21yGIjAGt4Hy7J $NEW_USER@localhost.localdomain\"\n\nsudo adduser -U -m $NEW_USER\necho \"$NEW_USER:atomic\" | chpasswd\nsudo mkdir /home/$NEW_USER/.ssh\nsudo echo \"$SSH_PUB_KEY\" > /home/$NEW_USER/.ssh/authorized_keys\nsudo chown -R ${NEW_USER}: /home/$NEW_USER/.ssh\nEND\n\n# Create the VM spec\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: sshvmi\nspec:\n terminationGracePeriodSeconds: 0\n domain:\n resources:\n requests:\n memory: 1024M\n devices:\n disks:\n - name: containerdisk\n disk:\n dev: vda\n - name: cloudinitdisk\n disk:\n dev: vdb\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/fedora-atomic-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script.sh | base64 -w0)\nEND\n\n# Post the VirtualMachineInstance spec to KubeVirt.\nkubectl create -f my-vmi.yaml\n\n# Connect to VM with passwordless SSH key\nssh -i <insert private key here> foo@<insert ip here>\n
A cloud-init network version 1 configuration can be set to configure the network at boot.
Cloud-init user-data must be set for cloud-init to parse network-config even if it is just the user-data config header:
#cloud-config\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-clear-text","title":"Cloud-init network-config as clear text","text":"
In the example below, a simple cloud-init network-config is stored in the cloudInitNoCloud Volume's networkData field as clean text. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
# Create a VM manifest with the network-config in\n# a cloudInitNoCloud volume's networkData field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkData: |\n network:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\n\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-base64-string","title":"Cloud-init network-config as base64 string","text":"
In the example below, a simple network-config is base64 encoded and stored in the cloudInitNoCloud Volume's networkDataBase64 field. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
Users also have the option of storing the network-config in a Kubernetes Secret and referencing the Secret in the VM's spec. Examples further down in the document illustrate how that is done.
# Create a simple network-config\n\ncat << END > network-config\nnetwork:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\nEND\n\n# Create a VM manifest with the networkData base64 encoded into\n# a cloudInitNoCloud volume's networkDataBase64 field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkDataBase64: $(cat network-config | base64 -w0)\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-k8s-secret","title":"Cloud-init network-config as k8s Secret","text":"
Users who wish to not store the cloud-init network-config directly in the VirtualMachineInstance spec have the option to store the network-config into a Kubernetes Secret and reference that Secret in the spec.
Multiple VirtualMachineInstance specs can reference the same Kubernetes Secret containing cloud-init network-config.
Below is an example of how to create a Kubernetes Secret containing a network-config and reference that Secret in the VM's spec.
# Create a simple network-config\n\ncat << END > network-config\nnetwork:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\nEND\n\n# Store the network-config in a Kubernetes Secret\nkubectl create secret generic my-vmi-secret --from-file=networkdata=network-config\n\n# Create a VM manifest and reference the Secret's name in the cloudInitNoCloud\n# Volume's secretRef field\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-registry-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkDataSecretRef:\n name: my-vmi-secret\nEND\n\n# Post the VM\nkubectl create -f my-vmi.yaml\n
Depending on the operating system distribution in use, cloud-init output is often printed to the console output on boot up. When developing userdata scripts, users can connect to the VM's console during boot up to debug.
Example of connecting to console using virtctl:
virtctl console <name of vmi>\n
"},{"location":"user_workloads/startup_scripts/#device-role-tagging","title":"Device Role Tagging","text":"
KubeVirt provides a mechanism for users to tag devices such as Network Interfaces with a specific role. The tag will be matched to the hardware address of the device and this mapping exposed to the guest OS via cloud-init.
This additional metadata will help the guest OS users with multiple networks interfaces to identify the devices that may have a specific role, such as a network device dedicated to a specific service or a disk intended to be used by a specific application (database, webcache, etc.)
This functionality already exists in platforms such as OpenStack. KubeVirt will provide the data in a similar format, known to users and services like cloud-init.
"},{"location":"user_workloads/startup_scripts/#sysprep-examples","title":"Sysprep Examples","text":""},{"location":"user_workloads/startup_scripts/#sysprep-in-a-configmap","title":"Sysprep in a ConfigMap","text":"
In the example below, a configMap with autounattend.xml file is used to modify the Windows iso image which is downloaded from Microsoft and creates a base installed Windows machine with virtio drivers installed and all the commands executed in post-install.ps1 For the below manifests to work it needs to have win10-iso DataVolume.
"},{"location":"user_workloads/startup_scripts/#launching-a-vm-from-template","title":"Launching a VM from template","text":"
From the above example after the sysprep command is executed in the post-install.ps1 and the vm is in shutdown state, A new VM can be launched from the base win10-template with additional changes mentioned from the below unattend.xml in sysprep-config. The new VM can take upto 5 minutes to be in running state since Windows goes through oobe setup in the background with the customizations specified in the below unattend.xml file.
By deploying KubeVirt on top of OpenShift the user can benefit from the OpenShift Template functionality.
"},{"location":"user_workloads/templates/#virtual-machine-templates","title":"Virtual machine templates","text":""},{"location":"user_workloads/templates/#what-is-a-virtual-machine-template","title":"What is a virtual machine template?","text":"
The KubeVirt projects provides a set of templates to create VMs to handle common usage scenarios. These templates provide a combination of some key factors that could be further customized and processed to have a Virtual Machine object. The key factors which define a template are
Workload Most Virtual Machine should be server or desktop to have maximum flexibility; the highperformance workload trades some of this flexibility to provide better performances.
Guest Operating System (OS) This allow to ensure that the emulated hardware is compatible with the guest OS. Furthermore, it allows to maximize the stability of the VM, and allows performance optimizations.
Size (flavor) Defines the amount of resources (CPU, memory) to allocate to the VM.
More documentation is available in the common templates subproject
"},{"location":"user_workloads/templates/#accessing-the-virtual-machine-templates","title":"Accessing the virtual machine templates","text":"
If you installed KubeVirt using a supported method you should find the common templates preinstalled in the cluster. Should you want to upgrade the templates, or install them from scratch, you can use one of the supported releases
You can edit the fields of the templates which define the amount of resources which the VMs will receive.
Each template can list a different set of fields that are to be considered editable. The fields are used as hints for the user interface, and also for other components in the cluster.
The editable fields are taken from annotations in the template. Here is a snippet presenting a couple of most commonly found editable fields:
Each entry in the editable field list must be a jsonpath. The jsonpath root is the objects: element of the template. The actually editable field is the last entry (the \"leaf\") of the path. For example, the following minimal snippet highlights the fields which you can edit:
objects:\n spec:\n template:\n spec:\n domain:\n cpu:\n sockets:\n VALUE # this is editable\n cores:\n VALUE # this is editable\n threads:\n VALUE # this is editable\n resources:\n requests:\n memory:\n VALUE # this is editable\n
"},{"location":"user_workloads/templates/#relationship-between-templates-and-vms","title":"Relationship between templates and VMs","text":"
Once processed the templates produce VM objects to be used in the cluster. The VMs produced from templates will have a vm.kubevirt.io/template label, whose value will be the name of the parent template, for example fedora-desktop-medium:
In addition, these VMs can include an optional label vm.kubevirt.io/template-namespace, whose value will be the namespace of the parent template, for example:
Please note that after the generation step VM and template objects have no relationship with each other besides the aforementioned label. Changes in templates do not automatically affect VMs or vice versa.
The templates provided by the kubevirt project provide a set of conventions and annotations that augment the basic feature of the openshift templates. You can customize your kubevirt-provided templates editing these annotations, or you can add them to your existing templates to make them consumable by the kubevirt services.
Here's a description of the kubevirt annotations. Unless otherwise specified, the following keys are meant to be top-level entries of the template metadata, like
All the following annotations are prefixed with defaults.template.kubevirt.io, which is omitted below for brevity. So the actual annotations you should use will look like
Unless otherwise specified, all annotations are meant to be safe defaults, both for performance and compatibility, and hints for the CNV-aware UI and tooling.
The default values for network, nic, volume, disk are meant to be the name of a section later in the document that the UI will find and consume to find the default values for the corresponding types. For example, considering the annotation defaults.template.kubevirt.io/disk: my-disk: we assume that later in the document it exists an element called my-disk that the UI can use to find the data it needs. The names actually don't matter as long as they are legal for kubernetes and consistent with the content of the document.
The KubeVirt projects provides a set of templates to create VMs to handle common usage scenarios. These templates provide a combination of some key factors that could be further customized and processed to have a Virtual Machine object.
The key factors which define a template are - Workload Most Virtual Machine should be server or desktop to have maximum flexibility; the highperformance workload trades some of this flexibility to provide better performances. - Guest Operating System (OS) This allow to ensure that the emulated hardware is compatible with the guest OS. Furthermore, it allows to maximize the stability of the VM, and allows performance optimizations. - Size (flavor) Defines the amount of resources (CPU, memory) to allocate to the VM.
VMs can be created through OpenShift Cluster Console UI . This UI supports creation VM using templates and templates features - flavors and workload profiles. To create VM from template, choose WorkLoads in the left panel >> choose Virtualization >> press to the \"Create Virtual Machine\" blue button >> choose \"Create from wizard\". Next, you have to see \"Create Virtual Machine\" window
There is the common-templates subproject. It provides official prepared and useful templates. You can also create templates by hand. You can find an example below, in the \"Example template\" section.
Note that the template above defines free parameters (NAME, SRC_PVC_NAME, SRC_PVC_NAMESPACE, CLOUD_USER_PASSWORD) and the NAME parameter does not have specified default value.
An OpenShift template has to be converted into the JSON file via oc process command, that also allows you to set the template parameters.
A complete example can be found in the KubeVirt repository.
The command above results in creating a Kubernetes object according to the specification given by the template \\(in this example it is an instance of the VirtualMachine object\\).
It's possible to get list of available parameters using the following command:
$ oc process -f dist/templates/fedora-desktop-large.yaml --parameters\nNAME DESCRIPTION GENERATOR VALUE\nNAME VM name expression fedora-[a-z0-9]{16}\nSRC_PVC_NAME Name of the PVC to clone fedora\nSRC_PVC_NAMESPACE Namespace of the source PVC kubevirt-os-images\nCLOUD_USER_PASSWORD Randomized password for the cloud-init user fedora expression [a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}\n
"},{"location":"user_workloads/templates/#starting-virtual-machine-from-the-created-object","title":"Starting virtual machine from the created object","text":"
The created object is now a regular VirtualMachine object and from now it can be controlled by accessing Kubernetes API resources. The preferred way how to do this from within the OpenShift environment is to use oc patch command.
Do not forget about virtctl tool. Using it in the real cases instead of using kubernetes API can be more convenient. Example:
$ virtctl start testvm\nVM testvm was scheduled to start\n
As soon as VM starts, Kubernetes creates new type of object - VirtualMachineInstance. It has similar name to VirtualMachine. Example (not full output, it's too big):
"},{"location":"user_workloads/templates/#cloud-init-script-and-parameters","title":"Cloud-init script and parameters","text":"
Kubevirt VM templates, just like kubevirt VM/VMI yaml configs, supports cloud-init scripts
"},{"location":"user_workloads/templates/#hack-use-pre-downloaded-image","title":"Hack - use pre-downloaded image","text":"
Kubevirt VM templates, just like kubevirt VM/VMI yaml configs, can use pre-downloaded VM image, which can be a useful feature especially in the debug/development/testing cases. No special parameters required in the VM template or VM/VMI yaml config. The main idea is to create Kubernetes PersistentVolume and PersistentVolumeClaim corresponding to existing image in the file system. Example:
Kubevirt VM templates are using dataVolumeTemplates. Before using dataVolumes, CDI has to be installed in cluster. After that, source Datavolume can be created.
You can follow Virtual Machine Lifecycle Guide for further reference.
"},{"location":"user_workloads/virtctl_client_tool/","title":"Download and Install the virtctl Command Line Interface","text":""},{"location":"user_workloads/virtctl_client_tool/#download-the-virtctl-client-tool","title":"Download the virtctl client tool","text":"
Basic VirtualMachineInstance operations can be performed with the stock kubectl utility. However, the virtctl binary utility is required to use advanced features such as:
Serial and graphical console access
It also provides convenience commands for:
Starting and stopping VirtualMachineInstances
Live migrating VirtualMachineInstances and canceling live migrations
Uploading virtual machine disk images
There are two ways to get it:
the most recent version of the tool can be retrieved from the official release page
it can be installed as a kubectl plugin using krew
This example uses a fedora cloud image in combination with cloud-init and an ephemeral empty disk with a capacity of 2Gi. For the sake of simplicity, the volume sources in this example are ephemeral and don't require a provisioner in your cluster.
In KubeVirt, the VM rollout strategy defines how changes to a VM object affect a running guest. In other words, it defines when and how changes to a VM object get propagated to its corresponding VMI object.
There are currently 2 rollout strategies: LiveUpdate and Stage. Only 1 can be specified and the default is Stage.
As long as the VMLiveUpdateFeatures is not enabled, the VM Rollout Strategy is ignored and defaults to \"Stage\". The feature gate is set in the KubeVirt custom resource (CR) like that:
The LiveUpdate VM rollout strategy tries to propagate VM object changes to running VMIs as soon as possible. For example, changing the number of CPU sockets will trigger a CPU hotplug.
Enable the LiveUpdate VM rollout strategy in the KubeVirt CR:
Any change made to a VM object when the rollout strategy is Stage will trigger the RestartRequired VM condition. When the rollout strategy is LiveUpdate, only non-propagatable changes will trigger the condition.
Once the RestartRequired condition is set on a VM object, no further changes can be propagated, even if the strategy is set to LiveUpdate. Changes will become effective on next reboot, and the condition will be removed.
The current implementation has the following limitations:
Once the RestartRequired condition is set, the only way to get rid of it is to restart the VM. In the future, we plan on implementing a way to get rid of it by reverting the VM template spec to its last non-RestartRequired state.
Cluster defaults are excluded from this logic. It means that changing a cluster-wide setting that impacts VM specs will not be live-updated, regardless of the rollout strategy.
The RestartRequired condition comes with a message stating what kind of change triggered the condition (CPU/memory/other). That message pertains only to the first change that triggered the condition. Additional changes that would usually trigger the condition will just get staged and no additional RestartRequired condition will be added.
Purpose of this document is to explain how to install virtio drivers for Microsoft Windows running in a fully virtualized guest.
"},{"location":"user_workloads/windows_virtio_drivers/#do-i-need-virtio-drivers","title":"Do I need virtio drivers?","text":"
Yes. Without the virtio drivers, you cannot use paravirtualized hardware properly. It would either not work, or will have a severe performance penalty.
For more information about VirtIO and paravirtualization, see VirtIO and paravirtualization
For more details on configuring your VirtIO driver please refer to Installing VirtIO driver on a new Windows virtual machine and Installing VirtIO driver on an existing Windows virtual machine.
"},{"location":"user_workloads/windows_virtio_drivers/#which-drivers-i-need-to-install","title":"Which drivers I need to install?","text":"
There are usually up to 8 possible devices that are required to run Windows smoothly in a virtualized environment. KubeVirt currently supports only:
viostor, the block driver, applies to SCSI Controller in the Other devices group.
viorng, the entropy source driver, applies to PCI Device in the Other devices group.
NetKVM, the network driver, applies to Ethernet Controller in the Other devices group. Available only if a virtio NIC is configured.
Other virtio drivers, that exists and might be supported in the future:
Balloon, the balloon driver, applies to PCI Device in the Other devices group
vioserial, the paravirtual serial driver, applies to PCI Simple Communications Controller in the Other devices group.
vioscsi, the SCSI block driver, applies to SCSI Controller in the Other devices group.
qemupciserial, the emulated PCI serial driver, applies to PCI Serial Port in the Other devices group.
qxl, the paravirtual video driver, applied to Microsoft Basic Display Adapter in the Display adapters group.
pvpanic, the paravirtual panic driver, applies to Unknown device in the Other devices group.
Note
Some drivers are required in the installation phase. When you are installing Windows onto the virtio block storage you have to provide an appropriate virtio driver. Namely, choose viostor driver for your version of Microsoft Windows, eg. does not install XP driver when you run Windows 10.
Other drivers can be installed after the successful windows installation. Again, please install only drivers matching your Windows version.
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-install-during-windows-install","title":"How to install during Windows install?","text":"
To install drivers before the Windows starts its install, make sure you have virtio-win package attached to your VirtualMachine as SATA CD-ROM. In the Windows installation, choose advanced install and load driver. Then please navigate to loaded Virtio CD-ROM and install one of viostor or vioscsi, depending on whichever you have set up.
Step by step screenshots:
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-install-after-windows-install","title":"How to install after Windows install?","text":"
After windows install, please go to Device Manager. There you should see undetected devices in \"available devices\" section. You can install virtio drivers one by one going through this list.
For more details on how to choose a proper driver and how to install the driver, please refer to the Windows Guest Virtual Machines on Red Hat Enterprise Linux 7.
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-obtain-virtio-drivers","title":"How to obtain virtio drivers?","text":"
The virtio Windows drivers are distributed in a form of containerDisk, which can be simply mounted to the VirtualMachine. The container image, containing the disk is located at: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags and the image be pulled as any other docker container:
However, pulling image manually is not required, it will be downloaded if not present by Kubernetes when deploying VirtualMachine.
"},{"location":"user_workloads/windows_virtio_drivers/#attaching-to-virtualmachine","title":"Attaching to VirtualMachine","text":"
KubeVirt distributes virtio drivers for Microsoft Windows in a form of container disk. The package contains the virtio drivers and QEMU guest agent. The disk was tested on Microsoft Windows Server 2012. Supported Windows version is XP and up.
The package is intended to be used as CD-ROM attached to the virtual machine with Microsoft Windows. It can be used as SATA CDROM during install phase or to provide drivers in an existing Windows installation.
Attaching the virtio-win package can be done simply by adding ContainerDisk to you VirtualMachine.
spec:\n domain:\n devices:\n disks:\n - name: virtiocontainerdisk\n # Any other disk you want to use, must go before virtioContainerDisk.\n # KubeVirt boots from disks in order ther are defined.\n # Therefore virtioContainerDisk, must be after bootable disk.\n # Other option is to choose boot order explicitly:\n # - https://kubevirt.io/api-reference/v0.13.2/definitions.html#_v1_disk\n # NOTE: You either specify bootOrder explicitely or sort the items in\n # disks. You can not do both at the same time.\n # bootOrder: 2\n cdrom:\n bus: sata\nvolumes:\n - containerDisk:\n image: quay.io/kubevirt/virtio-container-disk\n name: virtiocontainerdisk\n
Once you are done installing virtio drivers, you can remove virtio container disk by simply removing the disk from yaml specification and restarting the VirtualMachine.
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-,:!=\\[\\]\\(\\)\"/]+|\\.(?!\\d)","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome","text":"
The KubeVirt User Guide is divided into the following sections:
Architecture: Technical and conceptual overview of KubeVirt components
Quickstarts: A list of resources to help you learn KubeVirt basics
Cluster Administration: Cluster-level administration concepts and tasks
User Workloads: Creating, customizing, using, and monitoring virtual machines
Compute: Resource allocation and optimization for the virtualization layer
Network: Concepts and tasks for the networking and service layers
Storage: Concepts and tasks for the storage layer, including importing and exporting.
Release Notes: The release notes for all KubeVirt releases
Contributing: How you can contribute to this guide or the KubeVirt project
Virtualization Debugging: How to debug your KubeVirt cluster and virtual resources
"},{"location":"#try-it-out","title":"Try it out","text":"
Kubevirt on Killercoda: https://killercoda.com/kubevirt
Kubevirt on Minikube: https://kubevirt.io/quickstart_minikube/
Kubevirt on Kind: https://kubevirt.io/quickstart_kind/
Kubevirt on cloud providers: https://kubevirt.io/quickstart_cloud/
Users requiring virtualization services are speaking to the Virtualization API (see below) which in turn is speaking to the Kubernetes cluster to schedule requested Virtual Machine Instances (VMIs). Scheduling, networking, and storage are all delegated to Kubernetes, while KubeVirt provides the virtualization functionality.
KubeVirt provides additional functionality to your Kubernetes cluster, to perform virtual machine management
If we recall how Kubernetes is handling Pods, then we remember that Pods are created by posting a Pod specification to the Kubernetes API Server. This specification is then transformed into an object inside the API Server, this object is of a specific type or kind - that is how it's called in the specification. A Pod is of the type Pod. Controllers within Kubernetes know how to handle these Pod objects. Thus once a new Pod object is seen, those controllers perform the necessary actions to bring the Pod alive, and to match the required state.
This same mechanism is used by KubeVirt. Thus KubeVirt delivers three things to provide the new functionality:
Additional types - so called Custom Resource Definition (CRD) - are added to the Kubernetes API
Additional controllers for cluster wide logic associated with these new types
Additional daemons for node specific logic associated with new types
Once all three steps have been completed, you are able to
create new objects of these new types in Kubernetes (VMIs in our case)
and the new controllers take care to get the VMIs scheduled on some host,
and a daemon - the virt-handler - is taking care of a host - alongside the kubelet - to launch the VMI and configure it until it matches the required state.
One final note; both controllers and daemons are running as Pods (or similar) on top of the Kubernetes cluster, and are not installed alongside it. The type is - as said before - even defined inside the Kubernetes API server. This allows users to speak to Kubernetes, but modify VMIs.
The following diagram illustrates how the additional controllers and daemons communicate with Kubernetes and where the additional types are stored:
VirtualMachineInstanceReplicaSet (VMIRS) Bar -> VirtualMachineInstance (VMI) Bar
VirtualMachineInstance (VMI) is the custom resource that represents the basic ephemeral building block of an instance. In a lot of cases this object won't be created directly by the user but by a high level resource. High level resources for VMI can be:
VirtualMachine (VM) - StateFul VM that can be stopped and started while keeping the VM data and state.
VirtualMachineInstanceReplicaSet (VMIRS) - Similar to pods ReplicaSet, a group of ephemeral VMIs with similar configuration defined in a template.
KubeVirt is deployed on top of a Kubernetes cluster. This means that you can continue to run your Kubernetes-native workloads next to the VMIs managed through KubeVirt.
Furthermore: if you can run native workloads, and you have KubeVirt installed, you should be able to run VM-based workloads, too. For example, Application Operators should not require additional permissions to use cluster features for VMs, compared to using that feature with a plain Pod.
Security-wise, installing and using KubeVirt must not grant users any permission they do not already have regarding native workloads. For example, a non-privileged Application Operator must never gain access to a privileged Pod by using a KubeVirt feature.
We love virtual machines, think that they are very important and work hard to make them easy to use in Kubernetes. But even more than VMs, we love good design and modular, reusable components. Quite frequently, we face a dilemma: should we solve a problem in KubeVirt in a way that is best optimized for VMs, or should we take a longer path and introduce the solution to Pod-based workloads too?
To decide these dilemmas we came up with the KubeVirt Razor: \"If something is useful for Pods, we should not implement it only for VMs\".
For example, we debated how we should connect VMs to external network resources. The quickest way seems to introduce KubeVirt-specific code, attaching a VM to a host bridge. However, we chose the longer path of integrating with Multus and CNI and improving them.
A VirtualMachine provides additional management capabilities to a VirtualMachineInstance inside the cluster. That includes:
API stability
Start/stop/restart capabilities on the controller level
Offline configuration change with propagation on VirtualMachineInstance recreation
Ensure that the VirtualMachineInstance is running if it should be running
It focuses on a 1:1 relationship between the controller instance and a virtual machine instance. In many ways it is very similar to a StatefulSet with spec.replica set to 1.
"},{"location":"architecture/#how-to-use-a-virtualmachine","title":"How to use a VirtualMachine","text":"
A VirtualMachine will make sure that a VirtualMachineInstance object with an identical name will be present in the cluster when the VirtualMachine is in a Running state, which is controlled via the spec.runStrategy field. For more information regarding Run Strategies, please refer to Run Strategies
"},{"location":"architecture/#starting-and-stopping","title":"Starting and stopping","text":"
Virtual Machines can be turned on/off in an imperative or a declarative manner. Setting a spec.runStrategy like Always or Halted means that the system will continuously try to ensure the Virtual Machine is turned on/off:
# Start the virtual machine:\nkubectl patch virtualmachine vm --type merge -p \\\n '{\"spec\":{\"runStrategy\": \"Always\"}}'\n\n# Stop the virtual machine:\nkubectl patch virtualmachine vm --type merge -p \\\n '{\"spec\":{\"runStrategy\": \"Halted\"}}'\n
However, with the ManualrunStrategy, the user would imperatively choose when to turn the VM on or off, without the system performing any automatic actions:
# Start the virtual machine:\nvirtctl start vm\n\n# Stop the virtual machine:\nvirtctl stop vm\n
Find more details about a VM's life-cycle in the relevant section
Once a VirtualMachineInstance is created, its state will be tracked via status.created and status.ready fields of the VirtualMachine. If a VirtualMachineInstance exists in the cluster, status.created will equal true. If the VirtualMachineInstance is also ready, status.ready will equal true too.
If a VirtualMachineInstance reaches a final state but the spec.running equals true, the VirtualMachine controller will set status.ready to false and re-create the VirtualMachineInstance.
Additionally, the status.printableStatus field provides high-level summary information about the state of the VirtualMachine. This information is also displayed when listing VirtualMachines using the CLI:
$ kubectl get virtualmachines\nNAME AGE STATUS VOLUME\nvm1 4m Running\nvm2 11s Stopped\n
Here's the list of states currently supported and their meanings. Note that states may be added/removed in future releases, so caution should be used if consumed by automated programs.
Stopped: The virtual machine is currently stopped and isn't expected to start.
Provisioning: Cluster resources associated with the virtual machine (e.g., DataVolumes) are being provisioned and prepared.
Starting: The virtual machine is being prepared for running.
Running: The virtual machine is running.
Paused: The virtual machine is paused.
Migrating: The virtual machine is in the process of being migrated to another host.
Stopping: The virtual machine is in the process of being stopped.
Terminating: The virtual machine is in the process of deletion, as well as its associated resources (VirtualMachineInstance, DataVolumes, \u2026).
Unknown: The state of the virtual machine could not be obtained, typically due to an error in communicating with the host on which it's running.
A VirtualMachineInstance restart can be triggered by deleting the VirtualMachineInstance. This will also propagate configuration changes from the template in the VirtualMachine:
# Restart the virtual machine (you delete the instance!):\nkubectl delete virtualmachineinstance vm\n
To restart a VirtualMachine named vm using virtctl:
$ virtctl restart vm\n
This would perform a normal restart for the VirtualMachineInstance and would reschedule the VirtualMachineInstance on a new virt-launcher Pod
To force restart a VirtualMachine named vm using virtctl:
$ virtctl restart vm --force --grace-period=0\n
This would try to perform a normal restart, and would also delete the virt-launcher Pod of the VirtualMachineInstance with setting GracePeriodSeconds to the seconds passed in the command.
Currently, only setting grace-period=0 is supported.
Note
Force restart can cause data corruption, and should be used in cases of kernel panic or VirtualMachine being unresponsive to normal restarts.
A VirtualMachine will never restart or re-create a VirtualMachineInstance until the current instance of the VirtualMachineInstance is deleted from the cluster.
"},{"location":"architecture/#exposing-as-a-service","title":"Exposing as a Service","text":"
A VirtualMachine can be exposed as a service. The actual service will be available once the VirtualMachineInstance starts without additional interaction.
For example, exposing SSH port (22) as a ClusterIP service using virtctl after the VirtualMachine was created, but before it started:
All service exposure options that apply to a VirtualMachineInstance apply to a VirtualMachine.
See Service Objects for more details.
"},{"location":"architecture/#when-to-use-a-virtualmachine","title":"When to use a VirtualMachine","text":""},{"location":"architecture/#when-api-stability-is-required-between-restarts","title":"When API stability is required between restarts","text":"
A VirtualMachine makes sure that VirtualMachineInstance API configurations are consistent between restarts. A classical example are licenses which are bound to the firmware UUID of a virtual machine. The VirtualMachine makes sure that the UUID will always stay the same without the user having to take care of it.
One of the main benefits is that a user can still make use of defaulting logic, although a stable API is needed.
"},{"location":"architecture/#when-config-updates-should-be-picked-up-on-the-next-restart","title":"When config updates should be picked up on the next restart","text":"
If the VirtualMachineInstance configuration should be modifiable inside the cluster and these changes should be picked up on the next VirtualMachineInstance restart. This means that no hotplug is involved.
"},{"location":"architecture/#when-you-want-to-let-the-cluster-manage-your-individual-virtualmachineinstance","title":"When you want to let the cluster manage your individual VirtualMachineInstance","text":"
Kubernetes as a declarative system can help you to manage the VirtualMachineInstance. You tell it that you want this VirtualMachineInstance with your application running, the VirtualMachine will try to make sure it stays running.
Note
The current belief is that if it is defined that the VirtualMachineInstance should be running, it should be running. This is different from many classical virtualization platforms, where VMs stay down if they were switched off. Restart policies may be added if needed. Please provide your use-case if you need this!
Whenever you want to manipulate the VirtualMachine through the commandline you can use the kubectl command. The following are examples demonstrating how to do it.
# Define a virtual machine:\n kubectl create -f vm.yaml\n\n # Start the virtual machine:\n kubectl patch virtualmachine vm --type merge -p \\\n '{\"spec\":{\"runStrategy\":\"Always\"}}'\n\n # Look at virtual machine status and associated events:\n kubectl describe virtualmachine vm\n\n # Look at the now created virtual machine instance status and associated events:\n kubectl describe virtualmachineinstance vm\n\n # Stop the virtual machine instance:\n kubectl patch virtualmachine vm --type merge -p \\\n '{\"spec\":{\"runStrategy\":\"Halted\"}}'\n\n # Restart the virtual machine (you delete the instance!):\n kubectl delete virtualmachineinstance vm\n\n # Implicit cascade delete (first deletes the virtual machine and then the virtual machine instance)\n kubectl delete virtualmachine vm\n\n # Explicit cascade delete (first deletes the virtual machine and then the virtual machine instance)\n kubectl delete virtualmachine vm --cascade=true\n\n # Orphan delete (The running virtual machine is only detached, not deleted)\n # Recreating the virtual machine would lead to the adoption of the virtual machine instance\n kubectl delete virtualmachine vm --cascade=false\n
Welcome!! And thank you for taking the first step to contributing to the KubeVirt project. On this page you should be able to find all the information required to get started on your contribution journey, as well as information on how to become a community member and grow into roles of responsibility.
If you think something might be missing from this page, please help us by raising a bug!
Reviewing the following will prepare you for contributing:
If this is your first step in the world of open source, consider reading the CNCF's Start Contributing to Open Source page for an introduction to key concepts.
You should be comfortable with git. Most contributions follow the GitHub workflow of fork, branch, commit, open pull request, review changes, and merge to work effectively in the KubeVirt community. If you're new to git, git-scm.com has a nice set of tutorials.
Familiarize yourself with the various repositories of the KubeVirt GitHub organization.
Try the one of our quick start labs on killercoda, minikube, or kind.
See the \"Other ways to contribute\" section below.
For code contributors:
You need to be familiar with writing code in golang. See the golang tour to familiarize yourself.
To contribute to the core of the project, read the Developer contribution page and the getting started page in the kubevirt/kubevirt repo.
Alternatively, to contribute to its storage management add-on, check out the kubevirt/containerized-data-importer (CDI) repo, and their contribution page.
"},{"location":"contributing/#your-first-contribution","title":"Your first contribution","text":"
The following will help you decide where to start:
Check a repository issues list and label good-first-issue for issues that make good entry points.
Open a pull request using GitHub to documentation. The tutorials found here can be helpful https://lab.github.com/
Review a pull request from other community members for accuracy and language.
"},{"location":"contributing/#important-community-resources","title":"Important community resources","text":"
You should familiarize yourself with the following documents, which are critical to being a member of the community:
Code of Conduct: Everyone is expected to abide by the CoC to ensure an open and welcoming environment. Our CoC is based off the CNCF Code of conduct which also has a variety of translations.
Our community membership policy: How to become a member and grow into roles of responsibility.
KubeVirt v1.4 is built for Kubernetes v1.31 and additionally supported for the previous two versions. See the KubeVirt support matrix for more information.
To see the list of very excellent people who contributed to this release, see the KubeVirt release tag for v1.4.0.
[PR #13030] [alicefr] Removed the ManualRecoveryRequired field from the VolumeMigrationState and convert it to the VM condition ManualRecoveryRequired
[PR #12933] [ShellyKa13] VM admitter: improve validation of vm spec datavolumetemplate
[PR #12986] [lyarwood] The PreferredEfi preference is now only applied when a user has not already enabled either EFI or BIOS within the underlying VirtualMachine.
[PR #12169] [lyarwood] PreferredDiskDedicatedIoThread is now only applied to virtio disk devices
[PR #13090] [acardace] Allow live updating VMs' tolerations
[PR #12629] [jean-edouard] backend-storage now supports RWO FS
[PR #13086] [lyarwood] A new spec.configuration.instancetype.referencePolicy configurable has been added to the KubeVirt CR with support for reference (default), expand and expandAll policies provided.
[PR #12967] [xpivarc] BochsDisplayForEFIGuests is GAed, use \"kubevirt.io/vga-display-efi-x86\" annotation on Kubevirt CR before upgrading in case you need retain compatibility.
[PR #13001] [awels] Relaxed check on modify VM spec during VM snapshot to only check disks/volumes
[PR #13018] [orelmisan] Support Dynamic Primary Pod NIC Name
[PR #13078] [qinqon] Add dynamic pod interface name feature gate
[PR #13059] [EdDev] Network hotplug feature is declared as GA.
[PR #12753] [lyarwood] The CommonInstancetypesDeploymentGate feature gate and underlying feature are graduated to GA and now always enabled by default. A single new KubeVirt configurable is also introduced to allow cluster admins a way of explicitly disabling deployment when required.
[PR #12232] [lyarwood] The NUMA feature gate is now deprecated with the feature state graduated to GA and thus enabled by default
[PR #12943] [Barakmor1] The GPU feature gate is now deprecated with the feature state graduated to GA and thus enabled by default
[PR #13019] [0xFelix] virtctl: The flags --volume-clone-pvc, --volume-datasource and --volume-blank are deprecated in favor of the --volume-import flag.
[PR #12940] [Barakmor1] Deprecate the DockerSELinuxMCS FeatureGate
[PR #12578] [dasionov] Mark Running field as deprecated
[PR #11927] [lyarwood] All preferredCPUTopology constants prefixed with Prefer have been deprecated and will be removed in a future version of the instancetype.kubevirt.io API.
[PR #12848] [iholder101] Reduce default CompletionTimeoutPerGiB from 800s to 150s
[PR #12739] [lyarwood] A new PreferredEfi field has been added to preferences to express the preferred EFI configuration for a given VirtualMachine.
[PR #12617] [Acedus] grpc from go.mod is now correctly shipped in release images
[PR #12419] [nunnatsa] Add timeout to validation webhooks
[PR #11881] [lyarwood] The expand-spec subresource API now applies defaults to the returned VirtualMachine to ensure the VirtualMachineInstanceSpec within is closer to the eventual version used when starting the original VirtualMachine.
[PR #12268] [fossedihelm] Drop ForceRestart and ForceStop methods from client-go
[PR #12053] [vladikr] Only a single vgpu display option with ramfb will be configured per VMI.
[PR #11982] [RamLavi] Introduce validatingAdmissionPolicy to restrict node patches on virt-handler
[PR #13053] [0xFelix] virtctl: Users can specify a sysprep volume in VMs created with virtctl create vm
[PR #12855] [0xFelix] virtctl expose: Drop flag to set deprecated LoadBalancerIP option
[PR #13008] [0xFelix] virtctl: Allow creating a basic cloud-init config with virtctl create vm
[PR #12786] [0xFelix] virtctl: Created VMs can infer an instancetype or preference from PVC, Registry and Snapshot sources now.
[PR #12557] [codingben] Optionally create data source using virtctl image upload.
[PR #13072] [0xFelix] virtctl: virtctl create vm can now use the Access Credentials API to add credentials to a new VM
[PR #12395] [alicefr] Add new condition for VMIStorageLiveMigratable
[PR #12194] [mhenriks] VM supports kubevirt.io/immediate-data-volume-creation: \"false\" which delays creating DataVolumeTemplates until VM is started
[PR #12254] [jkinred] * Reduced the severity of log messages when a VolumeSnapshotClass is not found. When snapshots are not enabled for a volume, the reason will still be displayed in the status.volumeSnapshotStatuses field of a VirtualMachine resource.
[PR #12601] [mhenriks] vmsnapshot: Enable status subresource for snapshot.kubevirt.io api group
KubeVirt v1.3 is built for Kubernetes v1.30 and additionally supported for the previous two versions. See the KubeVirt support matrix for more information.
To see the list of fine folks who contributed to this release, see the KubeVirt release tag for v1.3.0.
[PR #11156] [nunnatsa] Move some verification from the VMI create validation webhook to the CRD
[PR #11500] [iholder101] Support HyperV Passthrough: automatically use all available HyperV features
[PR #11641] [alicefr] Add kubevirt.io/testWorkloadUpdateMigrationAbortion annotation and a mechanism to abort workload updates
[PR #11700] [alicefr] Add the updateVolumeStrategy field
[PR #11729] [lyarwood] spreadOptions have been introduced to preferences in order to allow for finer grain control of the preferSpreadpreferredCPUTopology. This includes the ability to now spread vCPUs across guest visible sockets, cores and threads.
[PR #10545] [lyarwood] ControllerRevisions containing instance types and preferences are now upgraded to their latest available version when the VirtualMachine owning them is resync'd by virt-controller.
[PR #11922] [alromeros] Bugfix: Fix VM manifest rendering in export controller
[PR #11367] [alromeros] Bugfix: Allow vmexport download redirections by printing logs into stderr
[PR #11219] [alromeros] Bugfix: Improve handling of IOThreads with incompatible buses
[PR #11372] [xpivarc] Bug-fix: Fix nil panic if VM update fails
[PR #11267] [mhenriks] BugFix: Ensure DataVolumes created by virt-controller (DataVolumeTemplates) are recreated and owned by the VM in the case of DR and backup/restore.
[PR #10900] [KarstenB] BugFix: Fixed incorrect APIVersion of APIResourceList
[PR #11306] [fossedihelm] fix(ksm): set the kubevirt.io/ksm-enabled node label to true if the ksm is managed by KubeVirt, instead of reflect the actual ksm value.
[PR #11701] [EdDev] The SLIRP core binding is deprecated and removed.
[PR #11901] [EdDev] The 'macvtap' core network binding is discontinued and removed.
[PR #11915] [ormergi] The 'passt' core network binding is discontinued and removed.
[PR #11404] [avlitman] KubeVirtComponentExceedsRequestedCPU and KubeVirtComponentExceedsRequestedMemory alerts are deprecated; they do not indicate a genuine issue.
[PR #11498] [acardace] Allow to hotplug memory for VMs with memory limits set
[PR #11479] [vladikr] virtual machines instance will no longer be stuck in an irrecoverable state after an interrupted postcopy migration. Instead, these will fail and could be restarted again.
[PR #11685] [fossedihelm] Updated go version of the client-go to 1.21
[PR #11344] [aerosouund] Refactor device plugins to use a base plugin and define a common interface
KubeVirt v1.2 is built for Kubernetes v1.29 and additionally supported for the previous two versions. See the KubeVirt support matrix for more information.
[PR #11054] [jean-edouard] New cluster-wide vmRolloutStrategy setting to define whether changes to VMs should either be always staged or live-updated when possible.
[PR #11001] [fossedihelm] Allow kubevirt.io:default clusterRole to get,list kubevirts
[PR #10961] [jcanocan] Reduced VM rescheduling time on node failure
[PR #10918] [orelmisan] VMClone: Emit an event in case restore creation fails
[PR #10567] [awels] Attachment pod creation is now rate limited
[PR #10526] [cfilleke] Documents steps to build the KubeVirt builder container
[PR #10479] [dharmit] Ability to run scripts through hook sidecardevice
[PR #10244] [hshitomi] Added \u201cadm\u201d subcommand under \u201cvirtctl\u201d, and \u201clog-verbosity\" subcommand under \u201cadm\u201d. The log-verbosity command is: to show the log verbosity of one or more components, to set the log verbosity of one or more components, and to reset the log verbosity of all components (reset to the default verbosity (2)).
[PR #10046] [victortoso] Add v1alpha3 for hooks and fix migration when using sidecars
[#10568][ormergi] Network binding plugin API support CNIs, new integration point on virt-launcher pod creation.
[#10309][lyarwood] cluster-wide common-instancetypes resources can now deployed by virt-operator using the CommonInstancetypesDeploymentGate feature gate.
[#10463][0xFelix] VirtualMachines: Introduce InferFromVolumeFailurePolicy in Instancetype- and PreferenceMatchers
[#10447][fossedihelm] Add a Feature Gate to KV CR to automatically set memory limits when a resource quota with memory limits is associated to the creation namespace
[#10477][jean-edouard] Dynamic KSM enabling and configuration
[#10110][tiraboschi] Stream guest serial console logs from a dedicated container
[#10015][victortoso] Implements USB host passthrough in permittedHostDevices of KubeVirt CRD
[#10184][acardace] Add memory hotplug feature
[#10231][kvaps] Propogate public-keys to cloud-init NoCloud meta-data
[#9673][germag] DownwardMetrics: Expose DownwardMetrics through virtio-serial channel.
[#10086][vladikr] allow live updating VM affinity and node selector
[#10272][ormergi] Introduce network binding plugin for Slirp networking, interfacing with Kubevirt new network binding plugin API.
[#10284][AlonaKaplan] Introduce an API for network binding plugins. The feature is behind \"NetworkBindingPlugins\" gate.
[#10101][acardace] Deprecate spec.config.machineType in KubeVirt CR.
[#9878][jean-edouard] The EFI NVRAM can now be configured to persist across reboots
[#9932][lyarwood] ControllerRevisions containing instancetype.kubevirt.ioCRDs are now decorated with labels detailing specific metadata of the underlying stashed object
[#10058][alicefr] Add field errorPolicy for disks
[#10004][AlonaKaplan] Hoyplug/unplug interfaces should be done by updating the VM spec template. virtctl and REST API endpoints were removed.
[#9896][ormergi] The VM controller now replicates spec interfaces MAC addresses to the corresponding interfaces in the VMI spec.
[#7708][VirrageS] nodeSelector and schedulerName fields have been added to VirtualMachineInstancetype spec.
[#7197][vasiliy-ul] Experimantal support of SEV attestation via the new API endpoints
[#9737][AlonaKaplan] On hotunplug - remove bridge, tap and dummy interface from virt-launcher and the caches (file and volatile) from the node.
[#10566][fossedihelm] Add 100Mi of memory overhead for vmi with dedicatedCPU or that wants GuaranteedQos
[#10496][fossedihelm] Automatically set cpu limits when a resource quota with cpu limits is associated to the creation namespace and the AutoResourceLimits FeatureGate is enabled
[#10543][0xFelix] Clear VM guest memory when ignoring inference failures
[#10366][ormergi] Kubevirt now delegates Slirp networking configuration to Slirp network binding plugin. In case you haven't registered Slirp network binding plugin image yet (i.e.: specify in Kubevirt config) the following default image would be used: quay.io/kubevirt/network-slirp-binding:20230830_638c60fc8. On next release (v1.2.0) no default image will be set and registering an image would be mandatory.
[#10185][AlonaKaplan] Add support to migration based SRIOV hotplug.
[#10116][ormergi] Existing detached interfaces with 'absent' state will be cleared from VMI spec.
[#9958][AlonaKaplan] Disable network interface hotplug/unplug for VMIs. It will be supported for VMs only.
[#10489][maiqueb] Remove the network-attachment-definition list and watch verbs from virt-controller's RBAC
[#10438][lyarwood] A new instancetype.kubevirt.io:viewClusterRole has been introduced that can be bound to users via a ClusterRoleBinding to provide read only access to the cluster scoped VirtualMachineCluster{Instancetype,Preference} resources.
[PR #9651][0xFelix] virtctl: Allow to specify memory of created VMs. Default to 512Mi if no instancetype was specified or is inferred.
[PR #9169][lyarwood] The dedicatedCPUPlacement attribute is once again supported within the VirtualMachineInstancetype and VirtualMachineClusterInstancetype CRDs after a recent bugfix improved VirtualMachine validations, ensuring defaults are applied before any attempt to validate.
[PR #9311][kubevirt-bot] fixes the requests/limits CPU number mismatch for VMs with isolatedEmulatorThread
[PR #9276][fossedihelm] Added foreground finalizer to virtual machine
[PR #9295][kubevirt-bot] Fix bug of possible re-trigger of memory dump
[PR #9270][kubevirt-bot] BugFix: Guestfs image url not constructed correctly
[PR #9234][kubevirt-bot] The dedicatedCPUPlacement attribute is once again supported within the VirtualMachineInstancetype and VirtualMachineClusterInstancetype CRDs after a recent bugfix improved VirtualMachine validations, ensuring defaults are applied before any attempt to validate.
[PR #9267][fossedihelm] This version of KubeVirt includes upgraded virtualization technology based on libvirt 9.0.0 and QEMU 7.2.0.
[PR #9197][kubevirt-bot] Fix addvolume not rejecting adding existing volume source, fix removevolume allowing to remove non hotpluggable volume
[PR #9120][0xFelix] Fix access to portforwarding on VMs/VMIs with the cluster roles kubevirt.io:admin and kubevirt.io:edit
[PR #9116][EdDev] Allow the specification of the ACPI Index on a network interface.
[PR #8774][avlitman] Added new Virtual machines CPU metrics:
[PR #9087][zhuchenwang] Open /dev/vhost-vsock explicitly to ensure that the right vsock module is loaded
[PR #9020][feitnomore] Adding support for status/scale subresources so that VirtualMachinePool now supports HorizontalPodAutoscaler
[PR #9085][0xFelix] virtctl: Add options to infer instancetype and preference when creating a VM
[PR #8917][xpivarc] Kubevirt can be configured with Seccomp profile. It now ships a custom profile for the launcher.
[PR #9054][enp0s3] do not inject LimitRange defaults into VMI
[PR #7862][vladikr] Store the finalized VMI migration status in the migration objects.
[PR #8878][0xFelix] Add 'create vm' command to virtctl
[PR #9040][lyarwood] inferFromVolume now uses labels instead of annotations to lookup default instance type and preference details from a referenced Volume. This has changed in order to provide users with a way of looking up suitably decorated resources through these labels before pointing to them within the VirtualMachine.
[PR #9039][orelmisan] client-go: Added context to additional VirtualMachineInstance's methods.
[PR #9018][orelmisan] client-go: Added context to additional VirtualMachineInstance's methods.
[PR #9025][akalenyu] BugFix: Hotplug pods have hardcoded resource req which don't comply with LimitRange maxLimitRequestRatio of 1
[PR #8908][orelmisan] client-go: Added context to some of VirtualMachineInstance's methods.
[PR #6863][rmohr] The install strategy job will respect the infra node placement from now on
[PR #8649][acardace] KubeVirt is now able to run VMs inside restricted namespaces.
[PR #8992][iholder101] Align with k8s fix for default limit range requirements
[PR #8889][rmohr] Add basic TLS encryption support for vsock websocket connections
[PR #8660][huyinhou] Fix remoteAddress field in virt-api log being truncated when it is an ipv6 address
[PR #8961][rmohr] Bump distroless base images
[PR #8952][rmohr] Fix read-only sata disk validation
[PR #8657][fossedihelm] Use an increasingly exponential backoff before retrying to start the VM, when an I/O error occurs.
[PR #8480][lyarwood] New inferFromVolume attributes have been introduced to the {Instancetype,Preference}Matchers of a VirtualMachine. When provided the Volume referenced by the attribute is checked for the following annotations with which to populate the {Instancetype,Preference}Matchers:
[PR #7762][VirrageS] Service kubevirt-prometheus-metrics now sets ClusterIP to None to make it a headless service.
[PR #8599][machadovilaca] Change KubevirtVmHighMemoryUsage threshold from 20MB to 50MB
[PR #7761][VirrageS] imagePullSecrets field has been added to KubeVirt CR to support deployments form private registries
[PR #8887][iholder101] Bugfix: use virt operator image if provided
[PR #8750][jordigilh] Fixes an issue that prevented running real time workloads in non-root configurations due to libvirt's dependency on CAP_SYS_NICE to change the vcpu's thread's scheduling and priority to FIFO and 1. The change of priority and scheduling is now executed in the virt-launcher for both root and non-root configurations, removing the dependency in libvirt.
[PR #8845][lyarwood] An empty Timer is now correctly omitted from Clock fixing bug #8844.
[PR #8842][andreabolognani] The virt-launcher pod no longer needs the SYS_PTRACE capability.
[PR #8734][alicefr] Change libguestfs-tools image using root appliance in qcow2 format
[PR #8764][ShellyKa13] Add list of included and excluded volumes in vmSnapshot
[PR #8811][iholder101] Custom components: support gs
[PR #8770][dhiller] Add Ginkgo V2 Serial decorator to serial tests as preparation to simplify parallel vs. serial test run logic
[PR #8808][acardace] Apply migration backoff only for evacuation migrations.
[PR #8525][jean-edouard] CR option mediatedDevicesTypes is deprecated in favor of mediatedDeviceTypes
[PR #8792][iholder101] Expose new custom components env vars to csv-generator and manifest-templator
[PR #8701][enp0s3] Consider the ParallelOutboundMigrationsPerNode when evicting VMs
[PR #8740][iholder101] Fix: Align Reenlightenment flows between converter.go and template.go
[PR #8530][acardace] Use exponential backoff for failing migrations
[PR #8720][0xFelix] The expand-spec subresource endpoint was renamed to expand-vm-spec and made namespaced
[PR #8458][iholder101] Introduce support for clones with a snapshot source (e.g. clone snapshot -> VM)
[PR #8716][rhrazdil] Add overhead of interface with Passt binding when no ports are specified
[PR #8619][fossedihelm] virt-launcher: use virtqemud daemon instead of libvirtd
[PR #8736][knopt] Added more precise rest_client_request_latency_seconds histogram buckets
[PR #8624][zhuchenwang] Add the REST API to be able to talk to the application in the guest VM via VSOCK.
[PR #8625][AlonaKaplan] iptables are no longer used by masquerade binding. Nodes with iptables only won't be able to run VMs with masquerade binding.
[PR #8673][iholder101] Allow specifying custom images for core components
[PR #8622][jean-edouard] Built with golang 1.19
[PR #8336][alicefr] Flag for setting the guestfs uid and gid
[PR #8667][huyinhou] connect VM vnc failed when virt-launcher work directory is not /
[PR #8368][machadovilaca] Use collector to set migration metrics
[PR #8558][xpivarc] Bug-fix: LimitRange integration now works when VMI is missing namespace
[PR #8404][andreabolognani] This version of KubeVirt includes upgraded virtualization technology based on libvirt 8.7.0, QEMU 7.1.0 and CentOS Stream 9.
[PR #8652][akalenyu] BugFix: Exporter pod does not comply with restricted PSA
[PR #8563][xpivarc] Kubevirt now runs with nonroot user by default
[PR #8442][kvaps] Add Deckhouse to the Adopters list
[PR #8546][zhuchenwang] Provides the Vsock feature for KubeVirt VMs.
[PR #8598][acardace] VMs configured with hugepages can now run using the default container_t SELinux type
[PR #8594][kylealexlane] Fix permission denied on on selinux relabeling on some kernel versions
[PR #8521][akalenyu] Add an option to specify a TTL for VMExport objects
[PR #7918][machadovilaca] Add alerts for VMs unhealthy states
[PR #8516][rhrazdil] When using Passt binding, virl-launcher has unprivileged_port_start set to 0, so that passt may bind to all ports.
[PR #7772][jean-edouard] The SELinux policy for virt-launcher is down to 4 rules, 1 for hugepages and 3 for virtiofs.
[PR #8402][jean-edouard] Most VMIs now run under the SELinux type container_t
[PR #8513][alromeros] [Bug-fix] Fix error handling in virtctl image-upload
[PR #8282][akrejcir] Improves instancetype and preference controller revisions. This is a backwards incompatible change and introduces a new v1alpha2 api for instancetype and preferences.
[PR #8272][jean-edouard] No more empty section in the kubevirt-cr manifest
[PR #8536][qinqon] Don't show a failure if ConfigDrive cloud init has UserDataSecretRef and not NetworkDataSecretRef
[PR #8375][xpivarc] Virtiofs can be used with Nonroot feature gate
[PR #8465][rmohr] Add a vnc screenshot REST endpoint and a \"virtctl vnc screenshot\" command for UI and script integration
[PR #8418][alromeros] Enable automatic token generation for VirtualMachineExport objects
[PR #8488][0xFelix] virtctl: Be less verbose when using the local ssh client
[PR #8396][alicefr] Add group flag for setting the gid and fsgroup in guestfs
[PR #8476][iholder-redhat] Allow setting virt-operator log verbosity through Kubevirt CR
[PR #8366][rthallisey] Move KubeVirt to a 15 week release cadence
[PR #8479][arnongilboa] Enable DataVolume GC by default in cluster-deploy
[PR #8474][vasiliy-ul] Fixed migration failure of VMs with containerdisks on systems with containerd
[PR #8316][ShellyKa13] Fix possible race when deleting unready vmsnapshot and the vm remaining frozen
[PR #8436][xpivarc] Kubevirt is able to run with restricted Pod Security Standard enabled with an automatic escalation of namespace privileges.
[PR #8197][alromeros] Add vmexport command to virtctl
[PR #8252][fossedihelm] Add tlsConfiguration to Kubevirt Configuration
[PR #8431][rmohr] Fix shadow status updates and periodic status updates on VMs, performed by the snapshot controller
[PR #8359][iholder-redhat] [Bugfix]: HyperV Reenlightenment VMIs should be able to start when TSC Frequency is not exposed
[PR #8330][jean-edouard] Important: If you use docker with SELinux enabled, set the DockerSELinuxMCSWorkaround feature gate before upgrading
[PR #8401][machadovilaca] Rename metrics to follow the naming convention
[PR #8129][mlhnono68] Fixes virtctl to support connection to clusters proxied by RANCHER or having special paths
[PR #8337][0xFelix] virtctl's native SSH client is now useable in the Windows console without workarounds
[PR #8257][awels] VirtualMachineExport now supports VM export source type.
[PR #8367][vladikr] fix the guest memory conversion by setting it to resources.requests.memory when guest memory is not explicitly provided
[PR #7990][ormergi] Deprecate SR-IOV live migration feature gate.
[PR #8069][lyarwood] The VirtualMachineInstancePreset resource has been deprecated ahead of removal in a future release. Users should instead use the VirtualMachineInstancetype and VirtualMachinePreference resources to encapsulate any shared resource or preferences characteristics shared by their VirtualMachines.
[PR #8326][0xFelix] virtctl: Do not log wrapped ssh command by default
[PR #8325][rhrazdil] Enable route_localnet sysctl option for masquerade binding at virt-handler
[PR #8159][acardace] Add support for USB disks
[PR #8006][lyarwood] AutoattachInputDevice has been added to Devices allowing an Input device to be automatically attached to a VirtualMachine on start up. PreferredAutoattachInputDevice has also been added to DevicePreferences allowing users to control this behaviour with a set of preferences.
[PR #8134][arnongilboa] Support DataVolume garbage collection
[PR #8157][StefanKro] TrilioVault for Kubernetes now supports KubeVirt for backup and recovery.
[PR #8273][alaypatel07] add server-side validations for spec.topologySpreadConstraints during object creation
[PR #8049][alicefr] Set RunAsNonRoot as default for the guestfs pod
[PR #8107][awels] Allow VirtualMachineSnapshot as a VirtualMachineExport source
[PR #7846][janeczku] Added support for configuring topology spread constraints for virtual machines.
[PR #8215][alaypatel07] support validation for spec.affinity fields during vmi creation
[PR #8071][oshoval] Relax networkInterfaceMultiqueue semantics: multi queue will configure only what it can (virtio interfaces).
[PR #7549][akrejcir] Added new API subresources to expand instancetype and preference.
[PR #7599][iholder-redhat] Introduce a mechanism to abort non-running migrations - fixes \"Unable to cancel live-migration if virt-launcher pod in pending state\" bug
[PR #8027][alaypatel07] Wait deletion to succeed all the way till objects are finalized in perfscale tests
[PR #8198][rmohr] Improve path handling for non-root virt-launcher workloads
[PR #8136][iholder-redhat] Fix cgroups unit tests: mock out underlying runc cgroup manager
[PR #8047][iholder-redhat] Deprecate live migration feature gate
[PR #7986][iholder-redhat] [Bug-fix]: Windows VM with WSL2 guest fails to migrate
[PR #7849][AlonaKaplan] [TECH PREVIEW] Introducing passt - a new approach to user-mode networking for virtual machines
[PR #7991][ShellyKa13] Virtctl memory dump with create flag to create a new pvc
[PR #8039][lyarwood] The flavor API and associated CRDs of VirtualMachine{Flavor,ClusterFlavor} are renamed to instancetype and VirtualMachine{Instancetype,ClusterInstancetype}.
[PR #8112][AlonaKaplan] Changing the default of virtctl exposeip-family parameter to be empty value instead of IPv4.
[PR #8073][orenc1] Bump runc to v1.1.2
[PR #8092][Barakmor1] Bump the version of emicklei/go-restful from 2.15.0 to 2.16.0
[PR #8053][alromeros] [Bug-fix]: Fix mechanism to fetch fs overhead when CDI resource has a different name
[PR #8035][0xFelix] Add option to wrap local scp client to scp command
[PR #7981][lyarwood] Conflicts will now be raised when using flavors if the VirtualMachine defines any CPU or Memory resource requests.
[PR #8068][awels] Set cache mode to match regular disks on hotplugged disks.
[PR #7336][iholder-redhat] Introduce clone CRD, controller and API
[PR #7791][iholder-redhat] Introduction of an initial deprecation policy
[PR #7875][lyarwood] ControllerRevisions of any VirtualMachineFlavorSpec or VirtualMachinePreferenceSpec are stored during the initial start of a VirtualMachine and used for subsequent restarts ensuring changes to the original VirtualMachineFlavor or VirtualMachinePreference do not modify the VirtualMachine and the VirtualMachineInstance it creates.
[PR #7881][ShellyKa13] Enable memory dump to be included in VMSnapshot
[PR #7926][qinqon] tests: Move main clean function to global AfterEach and create a VM per each infra_test.go Entry.
[PR #7845][janeczku] Fixed a bug that caused make generate to fail when API code comments contain backticks. (#7844, @janeczku)
[PR #7932][marceloamaral] Addition of kubevirt_vmi_migration_phase_transition_time_from_creation_seconds metric to monitor how long it takes to transition a VMI Migration object to a specific phase from creation time.
[PR #7879][marceloamaral] Faster VM phase transitions thanks to an increased virt-controller QPS/Burst
[PR #7807][acardace] make cloud-init 'instance-id' persistent across reboots
[PR #7928][iholder-redhat] bugfix: node-labeller now removes \"host-model-cpu.node.kubevirt.io/\" and \"host-model-required-features.node.kubevirt.io/\" prefixes
[PR #7841][jean-edouard] Non-root VMs will now migrate to root VMs after a cluster disables non-root.
[PR #7933][akalenyu] BugFix: Fix vm restore in case of restore size bigger then PVC requested size
[PR #7919][lyarwood] Device preferences are now applied to any default network interfaces or missing volume disks added to a VirtualMachineInstance at runtime.
[PR #7910][qinqon] tests: Create the expected readiness probe instead of liveness
[PR #7732][acardace] Prevent virt-handler from starting a migration twice
[PR #7594][alicefr] Enable to run libguestfs-tools pod to run as noroot user
[PR #7811][raspbeep] User now gets information about the type of commands which the guest agent does not support.
[PR #7590][awels] VMExport allows filesystem PVCs to be exported as either disks or directories.
[PR #7683][alicefr] Add --command and --local-ssh-opts\" options to virtctl ssh to execute remote command using local ssh method
[PR #7801][VirrageS] Empty (nil values) of Address and Driver fields in XML will be omitted.
[PR #7475][raspbeep] Adds the reason of a live-migration failure to a recorded event in case EvictionStrategy is set but live-migration is blocked due to its limitations.
[PR #7739][fossedihelm] Allow virtualmachines/migrate subresource to admin/edit users
[PR #7618][lyarwood] The requirement to define a Disk or Filesystem for each Volume associated with a VirtualMachine has been removed. Any Volumes without a Disk or Filesystem defined will have a Disk defined within the VirtualMachineInstance at runtime.
[PR #7529][xpivarc] NoReadyVirtController and NoReadyVirtOperator should be properly fired.
[PR #7465][machadovilaca] Add metrics for migrations and respective phases
[PR #7592][akalenyu] BugFix: virtctl guestfs incorrectly assumes image name
[PR #7533][akalenyu] Add several VM snapshot metrics
[PR #7574][rmohr] Pull in cdi dependencies with minimized transitive dependencies to ease API adoption
[PR #7318][iholder-redhat] Snapshot restores now support restoring to a target VM different than the source
[PR #7474][borod108] Added the following metrics for live migration: kubevirt_migrate_vmi_data_processed_bytes, kubevirt_migrate_vmi_data_remaining_bytes, kubevirt_migrate_vmi_dirty_memory_rate_bytes
[PR #7441][rmohr] Add virtctl scp to ease copying files from and to VMs and VMIs
[PR #7265][rthallisey] Support steady-state job types in the load-generator tool
[PR #7544][fossedihelm] Upgraded go version to 1.17.8
[PR #7582][acardace] Fix failed reported migrations when actually they were successful.
[PR #7546][0xFelix] Update virtio-container-disk to virtio-win version 0.1.217-1
[PR #7493][davidvossel] Adds new EvictionStrategy \"External\" for blocking eviction which is handled by an external controller
[PR #7563][akalenyu] Switch VolumeSnapshot to v1
[PR #7406][acardace] Reject LiveMigrate as a workload-update strategy if the LiveMigration feature gate is not enabled.
[PR #7103][jean-edouard] Non-persistent vTPM now supported. Keep in mind that the state of the TPM is wiped after each shutdown. Do not enable Bitlocker!
[PR #7277][andreabolognani] This version of KubeVirt includes upgraded virtualization technology based on libvirt 8.0.0 and QEMU 6.2.0.
[PR #7130][Barakmor1] Add field to kubevirtCR to set Prometheus ServiceMonitor object's namespace
[PR #7401][iholder-redhat] virt-api deployment is now scalable - replicas are determined by the number of nodes in the cluster
[PR #7500][awels] BugFix: Fixed RBAC for admin/edit user to allow virtualmachine/addvolume and removevolume. This allows for persistent disks
[PR #7328][apoorvajagtap] Don't ignore --identity-file when setting --local-ssh=true on virtctl ssh
[PR #7469][xpivarc] Users can now enable the NonRoot feature gate instead of NonRootExperimental
[PR #7451][fossedihelm] Reduce virt-launcher memory usage by splitting monitoring and launcher processes
[PR #7024][fossedihelm] Add an warning message if the client and server virtctl versions are not aligned
[PR #7486][rmohr] Move stable.txt location to a more appropriate path
[PR #7372][saschagrunert] Fixed KubeVirtComponentExceedsRequestedMemory alert complaining about many-to-many matching not allowed.
[PR #7426][iholder-redhat] Add warning for manually determining core-component replica count in Kubevirt CR
[PR #7424][maiqueb] Provide interface binding types descriptions, which will be featured in the KubeVirt API.
[PR #7422][orelmisan] Fixed setting custom guest pciAddress and bootOrder parameter(s) to a list of SR-IOV NICs.
[PR #7421][rmohr] Fix knowhosts file corruption for virtctl ssh
[PR #6854][rmohr] Make virtctl ssh work with ssh-rsa+ preauthentication
[PR #7267][iholder-redhat] Applied migration configurations can now be found in VMI's status
[PR #7321][iholder-redhat] [Migration Policies]: precedence to VMI labels over Namespace labels
[PR #7326][oshoval] The Ginkgo dependency has been upgraded to v2.1.3 (major version upgrade)
[PR #7361][SeanKnight] Fixed a bug that prevents virtctl from working with clusters accessed via Rancher authentication proxy, or any other cluster where the server URL contains a path component. (#3760)
[PR #7255][tyleraharrison] Users are now able to specify --address [ip_address] when using virtctl vnc rather than only using 127.0.0.1
[PR #7275][enp0s3] Add observedGeneration to virt-operator to have a race-free way to detect KubeVirt config rollouts
[PR #7233][xpivarc] Bug fix: Successfully aborted migrations should be reported now
[PR #7158][AlonaKaplan] Add masquerade VMs support to single stack IPv6.
[PR #7227][rmohr] Remove VMI informer from virt-api to improve scaling characteristics of virt-api
[PR #7288][raspbeep] Users now don't need to specify container for kubectl logs <vmi-pod> and kubectl exec <vmi-pod>.
[PR #6709][xpivarc] Workloads will be migrated to nonroot implementation if NonRoot feature gate is set. (Except VirtioFS)
[PR #7241][lyarwood] Fixed a bug that prevents only a unattend.xml configmap or secret being provided as contents for a sysprep disk. (#7240, @lyarwood)
[PR #7000][iholder-redhat] Adds a possibility to override default libvirt log filters though VMI annotations
[PR #7064][davidvossel] Fixes issue associated with blocked uninstalls when VMIs exist during removal
[PR #7097][iholder-redhat] [Bug fix] VMI with kernel boot stuck on \"Terminating\" status if more disks are defined
[PR #6700][VirrageS] Simplify replacing time.Ticker in agent poller and fix default values for qemu-*-interval flags
[PR #6581][ormergi] SRIOV network interfaces are now hot-plugged when disconnected manually or due to aborted migrations.
[PR #6924][EdDev] Support for legacy GPU definition is removed. Please see https://kubevirt.io/user-guide/virtual_machines/host-devices on how to define host-devices.
[PR #6735][uril] The command migrate_cancel was added to virtctl. It cancels an active VM migration.
[PR #6883][rthallisey] Add instance-type to cloud-init metadata
[PR #6999][maya-r] When expanding disk images, take the minimum between the request and the capacity - avoid using the full underlying file system on storage like NFS, local.
[PR #6946][vladikr] Numa information of an assigned device will be presented in the devices metadata
[PR #6042][iholder-redhat] Fully support cgroups v2, include a new cohesive package and perform major refactoring.
[PR #6968][vladikr] Added Writeback disk cache support
[PR #6995][sradco] Alert OrphanedVirtualMachineImages name was changed to OrphanedVirtualMachineInstances.
[PR #6923][rhrazdil] Fix issue with ssh being unreachable on VMIs with Istio proxy
[PR #6821][jean-edouard] Migrating VMIs that contain dedicated CPUs will now have properly dedicated CPUs on target
[PR #6793][oshoval] Add infoSource field to vmi.status.interfaces.
[PR #7004][iholder-redhat] Bugfix: Avoid setting block migration for volumes used by read-only disks
[PR #6959][enp0s3] generate event when target pod enters unschedulable phase
[PR #6888][assafad] Added common labels into alert definitions
[PR #6166][vasiliy-ul] Experimental support of AMD SEV
[PR #6980][vasiliy-ul] Updated the dependencies to include the fix for CVE-2021-43565 (KubeVirt is not affected)
[PR #6944][iholder-redhat] Remove disabling TLS configuration from Live Migration Policies
[PR #6800][jean-edouard] CPU pinning doesn't require hardware-assisted virtualization anymore
[PR #6501][ShellyKa13] Use virtctl image-upload to upload archive content
[PR #6918][iholder-redhat] Bug fix: Unscheduable host-model VMI alert is now properly triggered
[PR #6796][Barakmor1] 'kubevirt-operator' changed to 'virt-operator' on 'managed-by' label in kubevirt's components made by virt-operator
[PR #6036][jean-edouard] Migrations can now be done over a dedicated multus network
[PR #6933][erkanerol] Add a new lane for monitoring tests
[PR #6949][jean-edouard] KubeVirt components should now be successfully removed on CR deletion, even when using only 1 replica for virt-api and virt-controller
[PR #6954][maiqueb] Update the virtctl exposed services IPFamilyPolicyType default to IPFamilyPolicyPreferDualStack
[PR #6931][fossedihelm] added DryRun to AddVolumeOptions and RemoveVolumeOptions
[PR #6399][iholder-redhat] Introduce live migration policies that allow system-admins to have fine-grained control over migration configuration for different sets of VMs.
[PR #6880][iholder-redhat] Add full Podman support for make and make test
[PR #6702][acardace] implement virt-handler canary upgrade and rollback for faster and safer rollouts
[PR #6717][davidvossel] Introducing the VirtualMachinePools feature for managing stateful VMs at scale
[PR #6698][rthallisey] Add tracing to the virt-controller work queue
[PR #6762][fossedihelm] added DryRun mode to virtcl to migrate command
[PR #6891][rmohr] Fix \"Make raw terminal failed: The handle is invalid?\" issue with \"virtctl console\" when not executed in a pty
[PR #6783][rmohr] Skip SSH RSA auth if no RSA key was explicitly provided and not key exists at the default location
[PR #6191][marceloamaral] Addition of perfscale-load-generator to perform stress tests to evaluate the control plane
[PR #6248][VirrageS] Reduced logging in hot paths
[PR #6079][weihanglo] Hotplug volume can be unplugged at anytime and reattached after a VM restart.
[PR #6101][rmohr] Make k8s client rate limits configurable
[PR #6204][sradco] This PR adds to each alert the runbook url that points to a runbook that provides additional details on each alert and how to mitigate it.
[PR #5974][vladikr] a list of desired mdev types can now be provided in KubeVirt CR to kubevirt to configure these devices on relevant nodes
[PR #6147][rmohr] Fix rbac permissions for freeze/unfreeze, addvolume/removevolume, guestosinfo, filesystemlist and userlist
[PR #6161][ashleyschuett] Remove HostDevice validation on VMI creation
[PR #6078][zcahana] Report ErrImagePull/ImagePullBackOff VM status when image pull errors occur
[PR #6176][kwiesmueller] Fix goroutine leak in virt-handler, potentially causing issues with a high turnover of VMIs.
[PR #6047][ShellyKa13] Add phases to the vm snapshot api, specifically a failure phase
[PR #6058][acardace] Fix virt-launcher exit pod race condition
[PR #6035][davidvossel] Addition of perfscale-audit tool for auditing performance of control plane during stress tests
[PR #6145][acardace] virt-launcher: disable unencrypted TCP socket for libvirtd.
[PR #6163][davidvossel] Handle qemu processes in defunc (zombie) state
[PR #6105][ashleyschuett] Add VirtualMachineInstancesPerNode to KubeVirt CR under Spec.Configuration
[PR #6104][zcahana] Report FailedUnschedulable VM status when scheduling errors occur
[PR #5905][davidvossel] VM CrashLoop detection and Exponential Backoff
[PR #6070][acardace] Initiate Live-Migration using a unix socket (exposed by virt-handler) instead of an additional TCP<->Unix migration proxy started by virt-launcher
[PR #5728][vasiliy-ul] Live migration of VMs with hotplug volumes is now enabled
[PR #6109][rmohr] Fix virt-controller SCC: Reflect the need for NET_BIND_SERVICE in the virt-controller SCC.
[PR #5942][ShellyKa13] Integrate guest agent to online VM snapshot
[PR #6034][ashleyschuett] Go version updated to version 1.16.6
[PR #6040][yuhaohaoyu] Improved debuggability by keeping the environment of a failed VMI alive.
[PR #6068][dhiller] Add check that not all tests have been skipped
[PR #6041][xpivarc] [Experimental] Virt-launcher can run as non-root user
[PR #6062][iholder-redhat] replace dead \"stress\" binary with new, maintained, \"stress-ng\" binary
[PR #6029][mhenriks] CDI to 1.36.0 with DataSource support
[PR #4089][victortoso] Add support to USB Redirection with usbredir
[PR #5946][vatsalparekh] Add guest-agent based ping probe
[PR #6005][acardace] make containerDisk validation memory usage limit configurable
[PR #5791][zcahana] Added a READY column to the tabular output of \"kubectl get vm/vmi\"
[PR #6006][awels] DataVolumes created by DataVolumeTemplates will follow the associated VMs priority class.
[PR #5891][akalenyu] BugFix: Pending VMIs when creating concurrent bulk of VMs backed by WFFC DVs
[PR #5925][rhrazdil] Fix issue with Windows VMs not being assigned IP address configured in network-attachment-definition IPAM.
[PR #6007][rmohr] Fix: The bandwidth limitation on migrations is no longer ignored. Caution: The default bandwidth limitation of 64Mi is changed to \"unlimited\" to not break existing installations.
[PR #4944][kwiesmueller] Add /portforward subresource to VirtualMachine and VirtualMachineInstance that can tunnel TCP traffic through the API Server using a websocket stream.
[PR #5402][alicefr] Integration of libguestfs-tools and added new command guestfs to virtctl
[PR #5953][ashleyschuett] Allow Failed VMs to be stopped when using --force --gracePeriod 0
[PR #5876][mlsorensen] KubeVirt CR supports specifying a runtime class for virt-launcher pods via 'launcherRuntimeClass'.
[PR #5952][mhenriks] Use CDI beta API. CDI v1.20.0 is now the minimum requirement for kubevirt.
[PR #5846][rmohr] Add \"spec.cpu.numaTopologyPassthrough\" which allows emulating a host-alligned virtual numa topology for high performance
[PR #5894][rmohr] Add spec.migrations.disableTLS to the KubeVirt CR to allow disabling encrypted migrations. They stay secure by default.
[PR #5649][awels] Enhancement: remove one attachment pod per disk limit (behavior on upgrade with running VM with hotplugged disks is undefined)
[PR #5742][rmohr] VMIs which choose evictionStrategy LifeMigrate and request the invtsc cpuflag are now live-migrateable
[PR #5911][dhiller] Bumps kubevirtci, also suppresses kubectl.sh output to avoid confusing checks
[PR #5863][xpivarc] Fix: ioerrors don't cause crash-looping of notify server
[PR #5867][mlsorensen] New build target added to export virt-* images as a tar archive.
[PR #5766][davidvossel] Addition of kubevirt_vmi_phase_transition_seconds_since_creation to monitor how long it takes to transition a VMI to a specific phase from creation time.
[PR #5823][dhiller] Change default branch to main for kubevirt/kubevirt repository
[PR #5763][nunnatsa] Fix bug 1945589: Prevent migration of VMIs that uses virtiofs
[PR #5827][mlsorensen] Auto-provisioned disk images on empty PVCs now leave 128KiB unused to avoid edge cases that run the volume out of space.
[PR #5849][davidvossel] Fixes event recording causing a segfault in virt-controller
[PR #5797][rhrazdil] Add serviceAccountDisk automatically when Istio is enabled in VMI annotations
[PR #5723][ashleyschuett] Allow virtctl to stop VM and ignore the graceful shutdown period
[PR #5806][mlsorensen] configmap, secret, and cloud-init raw disks now work when underlying node storage has 4k blocks.
[PR #5623][iholder-redhat] [bugfix]: Allow migration of VMs with host-model CPU to migrate only for compatible nodes
[PR #5716][rhrazdil] Fix issue with virt-launcher becoming NotReady after migration when Istio is used.
[PR #5778][ashleyschuett] Update ca-bundle if it is unable to be parsed
[PR #5787][acardace] migrated references of authorization/v1beta1 to authorization/v1
[PR #5461][rhrazdil] Add support for Istio proxy when no explicit ports are specified on masquerade interface
[PR #5751][acardace] EFI VMIs with secureboot disabled can now be booted even when only OVMF_CODE.secboot.fd and OVMF_VARS.fd are present in the virt-launcher image
[PR #5629][andreyod] Support starting Virtual Machine with its guest CPU paused using virtctl start --paused
[PR #5725][dhiller] Generate REST API coverage report after functional tests
[PR #5758][davidvossel] Fixes kubevirt_vmi_phase_count to include all phases, even those that occur before handler hand off.
[PR #5745][ashleyschuett] Alert with resource usage exceeds resource requests
[PR #5759][mhenriks] Update CDI to 1.34.1
[PR #5038][kwiesmueller] Add exec command to VM liveness and readinessProbe executed through the qemu-guest-agent.
[PR #5431][alonSadan] Add NFT and IPTables rules to allow port-forward to non-declared ports on the VMI. Declaring ports on VMI will limit
[PR #5738][rmohr] Stop releasing jinja2 templates of our operator. Kustomize is the preferred way for customizations.
[PR #5691][ashleyschuett] Allow multiple shutdown events to ensure the event is received by ACPI
[PR #5558][ormergi] Drop virt-launcher SYS_RESOURCE capability
[PR #5694][davidvossel] Fixes null pointer dereference in migration controller
[PR #5416][iholder-redhat] [feature] support booting VMs from a custom kernel/initrd images with custom kernel arguments
[PR #5495][iholder-redhat] Go version updated to version 1.16.1.
[PR #5502][rmohr] Add downwardMetrics volume to expose a limited set of hots metrics to guests
[PR #5601][maya-r] Update libvirt-go to 7.3.0
[PR #5661][davidvossel] Validation/Mutation webhooks now explicitly define a 10 second timeout period
[PR #5652][rmohr] Automatically discover kube-prometheus installations and configure kubevirt monitoring
[PR #5631][davidvossel] Expand backport policy to include logging and debug fixes
[PR #5528][zcahana] Introduced a \"status.printableStatus\" field in the VirtualMachine CRD. This field is now displayed in the tabular output of \"kubectl get vm\".
[PR #5200][rhrazdil] Add support for Istio proxy traffic routing with masquerade interface. nftables is required for this feature.
[PR #5560][oshoval] virt-launcher now populates domain's guestOS info and interfaces status according guest agent also when doing periodic resyncs.
[PR #5514][rhrazdil] Fix live-migration failing when VM with masquarade iface has explicitly specified any of these ports: 22222, 49152, 49153
[PR #5583][dhiller] Reenable coverage
[PR #5129][davidvossel] Gracefully shutdown virt-api connections and ensure zero exit code under normal shutdown conditions
[PR #5582][dhiller] Fix flaky unit tests
[PR #5600][davidvossel] Improved logging around VM/VMI shutdown and restart
[PR #5564][omeryahud] virtctl rename support is dropped
[PR #5585][iholder-redhat] [bugfix] - reject VM defined with volume with no matching disk
[PR #5595][zcahana] Fixes adoption of orphan DataVolumes
[PR #5566][davidvossel] Release branches are now cut on the first business day of the month rather than the first day.
[PR #5108][Omar007] Fixes handling of /proc//mountpoint by working on the device information instead of mount information
[PR #5250][mlsorensen] Controller health checks will no longer actively test connectivity to the Kubernetes API. They will rely in health of their watches to determine if they have API connectivity.
[PR #5563][ashleyschuett] Set KubeVirt resources flags in the KubeVirt CR
[PR #5328][andreabolognani] This version of KubeVirt includes upgraded virtualization technology based on libvirt 7.0.0 and QEMU 5.2.0.
[PR #5738][rmohr] Stop releasing jinja2 templates of our operator. Kustomize is the preferred way for customizations.
[PR #5691][ashleyschuett] Allow multiple shutdown events to ensure the event is received by ACPI
[PR #5558][ormergi] Drop virt-launcher SYS_RESOURCE capability
[PR #5694][davidvossel] Fixes null pointer dereference in migration controller
[PR #5416][iholder-redhat] [feature] support booting VMs from a custom kernel/initrd images with custom kernel arguments
[PR #5495][iholder-redhat] Go version updated to version 1.16.1.
[PR #5502][rmohr] Add downwardMetrics volume to expose a limited set of hots metrics to guests
[PR #5601][maya-r] Update libvirt-go to 7.3.0
[PR #5661][davidvossel] Validation/Mutation webhooks now explicitly define a 10 second timeout period
[PR #5652][rmohr] Automatically discover kube-prometheus installations and configure kubevirt monitoring
[PR #5631][davidvossel] Expand backport policy to include logging and debug fixes
[PR #5528][zcahana] Introduced a \"status.printableStatus\" field in the VirtualMachine CRD. This field is now displayed in the tabular output of \"kubectl get vm\".
[PR #5200][rhrazdil] Add support for Istio proxy traffic routing with masquerade interface. nftables is required for this feature.
[PR #5560][oshoval] virt-launcher now populates domain's guestOS info and interfaces status according guest agent also when doing periodic resyncs.
[PR #5514][rhrazdil] Fix live-migration failing when VM with masquarade iface has explicitly specified any of these ports: 22222, 49152, 49153
[PR #5583][dhiller] Reenable coverage
[PR #5129][davidvossel] Gracefully shutdown virt-api connections and ensure zero exit code under normal shutdown conditions
[PR #5582][dhiller] Fix flaky unit tests
[PR #5600][davidvossel] Improved logging around VM/VMI shutdown and restart
[PR #5564][omeryahud] virtctl rename support is dropped
[PR #5585][iholder-redhat] [bugfix] - reject VM defined with volume with no matching disk
[PR #5595][zcahana] Fixes adoption of orphan DataVolumes
[PR #5566][davidvossel] Release branches are now cut on the first business day of the month rather than the first day.
[PR #5108][Omar007] Fixes handling of /proc//mountpoint by working on the device information instead of mount information
[PR #5250][mlsorensen] Controller health checks will no longer actively test connectivity to the Kubernetes API. They will rely in health of their watches to determine if they have API connectivity.
[PR #5563][ashleyschuett] Set KubeVirt resources flags in the KubeVirt CR
[PR #5328][andreabolognani] This version of KubeVirt includes upgraded virtualization technology based on libvirt 7.0.0 and QEMU 5.2.0.
[PR #6196][ashleyschuett] Allow multiple shutdown events to ensure the event is received by ACPI
[PR #6194][kubevirt-bot] Allow Failed VMs to be stopped when using --force --gracePeriod 0
[PR #6039][akalenyu] BugFix: Pending VMIs when creating concurrent bulk of VMs backed by WFFC DVs
[PR #5917][davidvossel] Fixes event recording causing a segfault in virt-controller
[PR #5886][ashleyschuett] Allow virtctl to stop VM and ignore the graceful shutdown period
[PR #5866][xpivarc] Fix: Kubevirt build with golang 1.14+ will not fail on validation of container disk with memory allocation error
[PR #5873][kubevirt-bot] Update ca-bundle if it is unable to be parsed
[PR #5822][kubevirt-bot] migrated references of authorization/v1beta1 to authorization/v1
[PR #5704][davidvossel] Fix virt-controller clobbering in progress vmi migration state during virt handler handoff
[PR #5707][kubevirt-bot] Fixes null pointer dereference in migration controller
[PR #5685][stu-gott] [bugfix] - reject VM defined with volume with no matching disk
[PR #5670][stu-gott] Validation/Mutation webhooks now explicitly define a 10 second timeout period
[PR #5653][kubevirt-bot] virt-launcher now populates domain's guestOS info and interfaces status according guest agent also when doing periodic resyncs.
[PR #5644][kubevirt-bot] Fix live-migration failing when VM with masquarade iface has explicitly specified any of these ports: 22222, 49152, 49153
[PR #5646][kubevirt-bot] virtctl rename support is dropped
[PR #5467][rmohr] Fixes upgrades from KubeVirt v0.36
[PR #5350][jean-edouard] Removal of entire permittedHostDevices section will now remove all user-defined host device plugins.
[PR #5242][jean-edouard] Creating more than 1 migration at the same time for a given VMI will now fail
[PR #4907][vasiliy-ul] Initial cgroupv2 support
[PR #5324][jean-edouard] Default feature gates can now be defined in the provider configuration.
[PR #5006][alicefr] Add discard=unmap option
[PR #5022][davidvossel] Fixes race condition between operator adding service and webhooks that can result in installs/uninstalls failing
[PR #5310][ashleyschuett] Reconcile CRD resources
[PR #5102][iholder-redhat] Go version updated to 1.14.14
[PR #4746][ashleyschuett] Reconcile Deployments, DaemonSets, MutatingWebhookConfigurations and ValidatingWebhookConfigurations
[PR #5037][ormergi] Hot-plug SR-IOV VF interfaces to VM's post a successful migration.
[PR #5269][mlsorensen] Prometheus metrics scraped from virt-handler are now served from the VMI informer cache, rather than calling back to the Kubernetes API for VMI information.
[PR #5138][davidvossel] virt-handler now waits up to 5 minutes for all migrations on the node to complete before shutting down.
[PR #5191][yuvalturg] Added a metric for monitoring CPU affinity
[PR #5215][xphyr] Enable detection of Intel GVT-g vGPU.
[PR #4760][rmohr] Make virt-handler heartbeat more efficient and robust: Only one combined PATCH and no need to detect different cluster types anymore.
[PR #5091][iholder-redhat] QEMU SeaBios debug logs are being seen as part of virt-launcher log.
[PR #5221][rmohr] Remove workload placement validation webhook which blocks placement updates when VMIs are running
[PR #5128][yuvalturg] Modified memory related metrics by adding several new metrics and splitting the swap traffic bytes metric
[PR #5084][ashleyschuett] Add validation to CustomizeComponents object on the KubeVirt resource
[PR #5182][davidvossel] New [release-blocker] functional test marker to signify tests that can never be disabled before making a release
[PR #5137][davidvossel] Added our policy around release branch backporting in docs/release-branch-backporting.md
[PR #5096][yuvalturg] Modified networking metrics by adding new metrics, splitting existing ones by rx/tx and using the device alias for the interface name when available
[PR #5088][awels] Hotplug works with hostpath storage.
[PR #4908][dhiller] Move travis tag and master builds to kubevirt prow.
[PR #4741][EdDev] Allow live migration for SR-IOV VM/s without preserving the VF interfaces.
[PR #6597][jean-edouard] VMs with cloud-init data should now properly migrate from older KubeVirt versions
[PR #5854][rthallisey] Prometheus metrics scraped from virt-handler are now served from the VMI informer cache, rather than calling back to the Kubernetes API for VMI information.
[PR #5561][kubevirt-bot] Fix docker save issues with kubevirt images
[PR #5010][jean-edouard] Migrated VMs stay persistent and can therefore survive S3, among other things.
[PR #4952][ashleyschuett] Create warning NodeUnresponsive event if a node is running a VMI pod but not a virt-handler pod
[PR #4686][davidvossel] Automated workload updates via new KubeVirt WorkloadUpdateStrategy API
[PR #4886][awels] Hotplug support for WFFC datavolumes.
[PR #5026][AlonaKaplan] virt-launcher, masquerade binding - prefer nft over iptables.
[PR #4921][borod108] Added support for Sysprep in the API. A user can now add a answer file through a ConfigMap or a Secret. The User Guide is updated accordingly. /kind feature
[PR #4874][ormergi] Add new feature-gate SRIOVLiveMigration,
[PR #4917][iholder-redhat] Now it is possible to enable QEMU SeaBios debug logs setting virt-launcher log verbosity to be greater than 5.
[PR #4966][arnongilboa] Solve virtctl \"Error when closing file ... file already closed\" that shows after successful image upload
[PR #4489][salanki] Fix a bug where a disk.img file was created on filesystems mounted via Virtio-FS
[PR #4982][xpivarc] Fixing handling of transient domain
[PR #4984][ashleyschuett] Change customizeComponents.patches such that '*' resourceName or resourceType matches all, all fields of a patch (type, patch, resourceName, resourceType) are now required.
[PR #4972][vladikr] allow disabling pvspinlock to support older guest kernels
[PR #4927][yuhaohaoyu] Fix of XML and JSON marshalling/unmarshalling for user defined device alias names which can make migrations fail.
[PR #4552][rthallisey] VMs using bridged networking will survive a kubelet restart by having kubevirt create a dummy interface on the virt-launcher pods, so that some Kubernetes CNIs, that have implemented the CHECK RPC call, will not cause VMI pods to enter a failed state.
[PR #4883][iholder-redhat] Bug fixed: Enabling libvirt debug logs only if debugLogs label value is \"true\", disabling otherwise.
[PR #4840][alicefr] Generate k8s events on IO errors
[PR #4940][vladikr] permittedHostDevices will support both upper and lowercase letters in the device ID
[PR #6596][jean-edouard] VMs with cloud-init data should now properly migrate from older KubeVirt versions
[PR #5853][rthallisey] Prometheus metrics scraped from virt-handler are now served from the VMI informer cache, rather than calling back to the Kubernetes API for VMI information.
[PR #4571][yuvalturg] Added os, workflow and flavor labels to the kubevirt_vmi_phase_count metric
[PR #4659][salanki] Fixed an issue where non-root users inside a guest could not write to a Virtio-FS mount.
[PR #4844][xpivarc] Fixed limits/requests to accept int again
[PR #4850][rmohr] virtio-scsi now respects the useTransitionalVirtio flag instead of assigning a virtio version depending on the machine layout
[PR #4672][vladikr] allow increasing logging verbosity of infra components in KubeVirt CR
[PR #4838][rmohr] Fix an issue where it may not be able to update the KubeVirt CR after creation for up to minutes due to certificate propagation delays
[PR #4806][rmohr] Make the mutating webhooks for VMIs and VMs required to avoid letting entities into the cluster which are not properly defaulted
[PR #4779][brybacki] Error message on virtctl image-upload to WaitForFirstConsumer DV
[PR #4749][davidvossel] KUBEVIRT_CLIENT_GO_SCHEME_REGISTRATION_VERSION env var for specifying exactly what client-go scheme version is registered
[PR #4772][jean-edouard] Faster VMI phase transitions thanks to an increased number of VMI watch threads in virt-controller
[PR #4730][rmohr] Add spec.domain.devices.useVirtioTransitional boolean to support virtio-transitional for old guests
[PR #4571][yuvalturg] Added os, workflow and flavor labels to the kubevirt_vmi_phase_count metric
[PR #4659][salanki] Fixed an issue where non-root users inside a guest could not write to a Virtio-FS mount.
[PR #4844][xpivarc] Fixed limits/requests to accept int again
[PR #4850][rmohr] virtio-scsi now respects the useTransitionalVirtio flag instead of assigning a virtio version depending on the machine layout
[PR #4672][vladikr] allow increasing logging verbosity of infra components in KubeVirt CR
[PR #4838][rmohr] Fix an issue where it may not be able to update the KubeVirt CR after creation for up to minutes due to certificate propagation delays
[PR #4806][rmohr] Make the mutating webhooks for VMIs and VMs required to avoid letting entities into the cluster which are not properly defaulted
[PR #4779][brybacki] Error message on virtctl image-upload to WaitForFirstConsumer DV
[PR #4749][davidvossel] KUBEVIRT_CLIENT_GO_SCHEME_REGISTRATION_VERSION env var for specifying exactly what client-go scheme version is registered
[PR #4772][jean-edouard] Faster VMI phase transitions thanks to an increased number of VMI watch threads in virt-controller
[PR #4730][rmohr] Add spec.domain.devices.useVirtioTransitional boolean to support virtio-transitional for old guests
[PR #4872][kubevirt-bot] Add spec.domain.devices.useVirtioTransitional boolean to support virtio-transitional for old guests
[PR #4855][kubevirt-bot] Fix an issue where it may not be able to update the KubeVirt CR after creation for up to minutes due to certificate propagation delays
[PR #4669][kwiesmueller] Add nodeSelector to kubevirt components restricting them to run on linux nodes only.
[PR #4648][davidvossel] Update libvirt base container to be based of packages in rhel-av 8.3
[PR #4653][qinqon] Allow configure cloud-init with networkData only.
[PR #4644][ashleyschuett] Operator validation webhook will deny updates to the workloads object of the KubeVirt CR if there are running VMIs
[PR #3349][davidvossel] KubeVirt v1 GA api
[PR #4645][maiqueb] Re-introduce the CAP_NET_ADMIN, to allow migration of VMs already having it.
[PR #4546][yuhaohaoyu] Failure detection and handling for VM with EFI Insecure Boot in KubeVirt environments where EFI Insecure Boot is not supported by design.
[PR #4625][awels] virtctl upload now shows error when specifying access mode of ReadOnlyMany
[PR #4667][kubevirt-bot] Update libvirt base container to be based of packages in rhel-av 8.3
[PR #4634][kubevirt-bot] Failure detection and handling for VM with EFI Insecure Boot in KubeVirt environments where EFI Insecure Boot is not supported by design.
[PR #4647][kubevirt-bot] Re-introduce the CAP_NET_ADMIN, to allow migration of VMs already having it.
[PR #4458][awels] It is now possible to hotplug DataVolume and PVC volumes into a running Virtual Machine.
[PR #4025][brybacki] Adds a special handling for DataVolumes in WaitForFirstConsumer state to support CDI's delayed binding mode.
[PR #4217][mfranczy] Set only an IP address for interfaces reported by qemu-guest-agent. Previously that was CIDR.
[PR #4195][davidvossel] AccessCredentials API for dynamic user/password and ssh public key injection
[PR #4335][oshoval] VMI status displays SRIOV interfaces with their network name only when they have originally
[PR #4408][andreabolognani] This version of KubeVirt includes upgraded virtualization technology based on libvirt 6.6.0 and QEMU 5.1.0.
[PR #4514][ArthurSens] domain label removed from metric kubevirt_vmi_memory_unused_bytes
[PR #4542][danielBelenky] Fix double migration on node evacuation
[PR #4506][maiqueb] Remove CAP_NET_ADMIN from the virt-launcher pod.
[PR #4501][AlonaKaplan] CAP_NET_RAW removed from virt-launcher.
[PR #4488][salanki] Disable Virtio-FS metadata cache to prevent OOM conditions on the host.
[PR #3937][vladikr] Generalize host devices assignment. Provides an interface between kubevirt and external device plugins. Provides a mechanism for accesslisting host devices.
[PR #4443][rmohr] All kubevirt webhooks support now dry-runs.
[PR #4409][vladikr] Increase the static memory overhead by 10Mi
[PR #4272][maiqueb] Add ip-family to the virtctl expose command.
[PR #4398][rmohr] VMIs reflect deleted stuck virt-launcher pods with the \"PodTerminating\" reason in the ready condition. The VMIRs detects this reason and immediately creates replacement VMIs.
[PR #4393][salanki] Disable legacy service links in virt-launcher Pods to speed up Pod instantiation and decrease Kubelet load in namespaces with many services.
[PR #2935][maiqueb] Add the macvtap bind mechanism.
[PR #4132][mstarostik] fixes a bug that prevented unique device name allocation when configuring both SCSI and SATA drives
[PR #3257][xpivarc] Added support of kubectl explain for Kubevirt resources.
[PR #4288][ezrasilvera] Adding DownwardAPI volumes type
[PR #4233][maya-r] Update base image used for pods to Fedora 31.
[PR #4192][xpivarc] We now run gosec in Kubevirt
[PR #4328][stu-gott] Version 2.x QEMU guest agents are supported.
[PR #4289][AlonaKaplan] Masquerade binding - set the virt-launcher pod interface MTU on the bridge.
[PR #4300][maiqueb] Update the NetworkInterfaceMultiqueue openAPI documentation to better specify its semantics within KubeVirt.
[PR #4277][awels] PVCs populated by DVs are now allowed as volumes.
[PR #4265][dhiller] Fix virtctl help text when running as a plugin
[PR #4273][dhiller] Only run Travis build for PRs against release branches
[PR #4315][kubevirt-bot] PVCs populated by DVs are now allowed as volumes.
[PR #3837][jean-edouard] VM interfaces with no bootOrder will no longer be candidates for boot when using the BIOS bootloader, as documented
[PR #3879][ashleyschuett] KubeVirt should now be configured through the KubeVirt CR configuration key. The usage of the kubevirt-config configMap will be deprecated in the future.
[PR #4074][stu-gott] Fixed bug preventing non-admin users from pausing/unpausing VMs
[PR #4016][ashleyschuett] Allow for post copy VMI migrations
[PR #4235][davidvossel] Fixes timeout failure that occurs when pulling large containerDisk images
[PR #4263][rmohr] Add readiness and liveness probes to virt-handler, to clearly indicate readiness
[PR #4248][maiqueb] always compile KubeVirt with selinux support on pure go builds.
[PR #4012][danielBelenky] Added support for the eviction API for VMIs with eviction strategy. This enables VMIs to be live-migrated when the node is drained or when the descheduler wants to move a VMI to a different node.
[PR #4075][ArthurSens] Metric kubevirt_vmi_vcpu_seconds' state label is now exposed as a human-readable state instead of an integer
[PR #4162][vladikr] introduce a cpuAllocationRatio config parameter to normalize the number of CPUs requested for a pod, based on the number of vCPUs
[PR #4177][maiqueb] Use vishvananda/netlink instead of songgao/water to create tap devices.
[PR #4092][stu-gott] Allow specifying nodeSelectors, affinity and tolerations to control where KubeVirt components will run
[PR #3927][ArthurSens] Adds new metric kubevirt_vmi_memory_unused_bytes
[PR #3493][vladikr] virtio-fs is being added as experimental, protected by a feature-gate that needs to be enabled in the kubevirt config by the administrator
[PR #4193][mhenriks] Add snapshot.kubevirt.io to admin/edit/view roles
[PR #4149][qinqon] Bump kubevirtci to k8s-1.19
[PR #3471][crobinso] Allow hiding that the VM is running on KVM, so that Nvidia graphics cards can be passed through
[PR #4115][phoracek] Add conformance automation and manifest publishing
[PR #3733][davidvossel] each PRs description.
[PR #4082][mhenriks] VirtualMachineRestore API and implementation
[PR #4154][davidvossel] Fixes issue with Service endpoints not being updated properly in place during KubeVirt updates.
[PR #3289][vatsalparekh] Add option to run only VNC Proxy in virtctl
[PR #4027][alicefr] Added memfd as default memory backend for hugepages. This introduces the new annotation kubevirt.io/memfd to disable memfd as default and fallback to the previous behavior.
[PR #3612][ashleyschuett] Adds customizeComponents to the kubevirt api
[PR #4029][cchengleo] Fix an issue which prevented virt-operator from installing monitoring resources in custom namespaces.
[PR #4031][rmohr] Initial support for sonobuoy for conformance testing
[PR #3226][vatsalparekh] Added tests to verify custom pciAddress slots and function
[PR #4048][davidvossel] Improved reliability for failed migration retries
[PR #3585][mhenriks] \"virtctl image-upload pvc ...\" will create the PVC if it does not exist
[PR #3945][xpivarc] KubeVirt is now being built with Go1.13.14
[PR #3845][ArthurSens] action required: The domain label from VMI metrics is being removed and may break dashboards that use the domain label to identify VMIs. Use name and namespace labels instead
[PR #4011][dhiller] ppc64le arch has been disabled for the moment, see https://github.com/kubevirt/kubevirt/issues/4037
[PR #3875][stu-gott] Resources created by KubeVirt are now labelled more clearly in terms of relationship and role.
[PR #3791][ashleyschuett] make node as kubevirt.io/schedulable=false on virt-handler restart
[PR #3998][vladikr] the local provider is usable again.
[PR #3290][maiqueb] Have virt-handler (KubeVirt agent) create the tap devices on behalf of the virt-launchers.
[PR #3957][AlonaKaplan] virt-launcher support Ipv6 on dual stack cluster.
[PR #3952][davidvossel] Fixes rare situation where vmi may not properly terminate if failure occurs before domain starts.
[PR #3973][xpivarc] Fixes VMs with clock.timezone set.
[PR #3923][danielBelenky] Add support to configure QEMU I/O mode for VMIs
[PR #3889][rmohr] The status fields for our CRDs are now protected on normal PATCH and PUT operations.The /status subresource is now used where possible for status updates.
[PR #3921][vladikr] use correct memory units in libvirt xml
[PR #3893][davidvossel] Adds recurring period that rsyncs virt-launcher domains with virt-handler
[PR #3880][sgarbour] Better error message when input parameters are not the expected number of parameters for each argument. Help menu will popup in case the number of parameters is incorrect.
[PR #3785][xpivarc] Vcpu wait metrics available
[PR #3642][vatsalparekh] Add a way to update VMI Status with latest Pod IP for Masquerade bindings
[PR #3636][ArthurSens] Adds kubernetes metadata.labels as VMI metrics' label
[PR #3825][awels] Virtctl now prints error messages from the response body on upload errors.
[PR #3830][davidvossel] Fixes re-establishing domain notify client connections when domain notify server restarts due to an error event.
[PR #3778][danielBelenky] Do not emit a SyncFailed event if we fail to sync a VMI in a final state
[PR #3803][andreabolognani] Not sure what to write here (see above)
[PR #2694][rmohr] Use native go libraries for selinux to not rely on python-selinux tools like semanage, which are not always present.
[PR #3692][victortoso] QEMU logs can now be fetched from outside the pod
[PR #3738][enp0s3] Restrict creation of VMI if it has labels that are used internally by Kubevirt components.
[PR #3725][danielBelenky] The tests binary is now part of the release and can be consumed from the GitHub release page.
[PR #3684][rmohr] Log if critical devices, like kvm, which virt-handler wants to expose are not present on the node.
[PR #3166][petrkotas] Introduce new virtctl commands:
[PR #3708][andreabolognani] Make qemu work on GCE by pulling in a fix for https://bugzilla.redhat.com/show_bug.cgi?id=1822682
Container disks are now secure and they are not copied anymore on every start. Old container disks can still be used in the same secure way, but new container disks can't be used on older kubevirt releases
Create specific SecurityContextConstraints on OKD instead of using the privileged SCC
Added clone authorization check for DataVolumes with PVC source
The sidecar feature is feature-gated now
Use container image shasums instead of tags for KubeVirt deployments
Protect control plane components against voluntary evictions with a PodDisruptionBudget of MinAvailable=1
Replaced hardcoded virtctl by using the basename of the call, this enables nicer output when installed via krew plugin package manager
Added RNG device to all Fedora VMs in tests and examples (newer kernels might block bootimg while waiting for entropy)
The virtual memory is now set to match the memory limit, if memory limit is specified and guest memory is not
Support nftable for CoreOS
Added a block-volume flag to the virtctl image-upload command
Improved virtctl console/vnc data flow
Removed DataVolumes feature gate in favor of auto-detecting CDI support
Removed SR-IOV feature gate, it is enabled by default now
VMI-related metrics have been renamed from kubevirt_vm_ to kubevirt_vmi_ to better reflect their purpose
Added metric to report the VMI count
Improved integration with HCO by adding a CSV generator tool and modified KubeVirt CR conditions
container Disks are now secure and they are not copied anymore on every start. Old container Disks can still be used in the same secure way, but new container Disks can't be used on older kubevirt releases
Create specific SecurityContextConstraints on OKD instead of using the privileged SCC
Added clone authorization check for DataVolumes with PVC source
The sidecar feature is feature-gated now
Use container image shasum's instead of tags for KubeVirt deployments
Protect control plane components against voluntary evictions with a PodDisruptionBudget of MinAvailable=1
Replaced hardcoded virtctl by using the basename of the call, this enables nicer output when installed via krew plugin package manager
Added RNG device to all Fedora VMs in tests and examples (newer kernels might block boot img while waiting for entropy)
The virtual memory is now set to match the memory limit, if memory limit is specified and guest memory is not
Support nftable for CoreOS
Added a block-volume flag to the virtctl image-upload command
Improved virtctl console/vnc data flow
Removed DataVolumes feature gate in favor of auto-detecting CDI support
Removed SR-IOV feature gate, it is enabled by default now
VMI-related metrics have been renamed from kubevirt_vm_ to kubevirt_vmi_ to better reflect their purpose
Added metric to report the VMI count
Improved integration with HCO by adding a CSV generator tool and modified KubeVirt CR conditions
KubeVirt has a set of features that are not mature enough to be enabled by default. As such, they are protected by a Kubernetes concept called feature gates.
"},{"location":"cluster_admin/activating_feature_gates/#how-to-activate-a-feature-gate","title":"How to activate a feature gate","text":"
You can activate a specific feature gate directly in KubeVirt's CR, by provisioning the following yaml, which uses the LiveMigration feature gate as an example:
Note: the name of the feature gates is case sensitive.
The snippet above assumes KubeVirt is installed in the kubevirt namespace. Change the namespace to suite your installation.
"},{"location":"cluster_admin/activating_feature_gates/#list-of-feature-gates","title":"List of feature gates","text":"
The list of feature gates (which evolve in time) can be checked directly from the source code.
"},{"location":"cluster_admin/annotations_and_labels/","title":"Annotations and labels","text":"
KubeVirt builds on and exposes a number of labels and annotations that either are used for internal implementation needs or expose useful information to API users. This page documents the labels and annotations that may be useful for regular API consumers. This page intentionally does not list labels and annotations that are merely part of internal implementation.
Note: Annotations and labels that are not specific to KubeVirt are also documented here.
This label marks resources that belong to KubeVirt. An optional value may indicate which specific KubeVirt component a resource belongs to. This label may be used to list all resources that belong to KubeVirt, for example, to uninstall it from a cluster.
This annotation is regularly updated by virt-handler to help determine if a particular node is alive and hence should be available for new virtual machine instance scheduling.
The KubeVirt VirtualMachineInstance API is implemented using a Kubernetes Custom Resource Definition (CRD). Because of this, KubeVirt is able to leverage a couple of features Kubernetes provides in order to perform validation checks on our API as objects created and updated on the cluster.
"},{"location":"cluster_admin/api_validation/#how-api-validation-works","title":"How API Validation Works","text":""},{"location":"cluster_admin/api_validation/#crd-openapiv3-schema","title":"CRD OpenAPIv3 Schema","text":"
The KubeVirt API is registered with Kubernetes at install time through a series of CRD definitions. KubeVirt includes an OpenAPIv3 schema in these definitions which indicates to the Kubernetes Apiserver some very basic information about our API, such as what fields are required and what type of data is expected for each value.
This OpenAPIv3 schema validation is installed automatically and requires no thought on the users part to enable.
"},{"location":"cluster_admin/api_validation/#admission-control-webhooks","title":"Admission Control Webhooks","text":"
The OpenAPIv3 schema validation is limited. It only validates the general structure of a KubeVirt object looks correct. It does not however verify that the contents of that object make sense.
With OpenAPIv3 validation alone, users can easily make simple mistakes (like not referencing a volume's name correctly with a disk) and the cluster will still accept the object. However, the VirtualMachineInstance will of course not start if these errors in the API exist. Ideally we'd like to catch configuration issues as early as possible and not allow an object to even be posted to the cluster if we can detect there's a problem with the object's Spec.
In order to perform this advanced validation, KubeVirt implements its own admission controller which is registered with kubernetes as an admission controller webhook. This webhook is registered with Kubernetes at install time. As KubeVirt objects are posted to the cluster, the Kubernetes API server forwards Creation requests to our webhook for validation before persisting the object into storage.
Note however that the KubeVirt admission controller requires features to be enabled on the cluster in order to be enabled.
"},{"location":"cluster_admin/api_validation/#enabling-kubevirt-admission-controller-on-kubernetes","title":"Enabling KubeVirt Admission Controller on Kubernetes","text":"
When provisioning a new Kubernetes cluster, ensure that both the MutatingAdmissionWebhook and ValidatingAdmissionWebhook values are present in the Apiserver's --admission-control cli argument.
Below is an example of the --admission-control values we use during development
Note that the old --admission-control flag was deprecated in 1.10 and replaced with --enable-admission-plugins. MutatingAdmissionWebhook and ValidatingAdmissionWebhook are enabled by default.
"},{"location":"cluster_admin/api_validation/#enabling-kubevirt-admission-controller-on-okd","title":"Enabling KubeVirt Admission Controller on OKD","text":"
OKD also requires the admission control webhooks to be enabled at install time. The process is slightly different though. With OKD, we enable webhooks using an admission plugin.
These admission control plugins can be configured in openshift-ansible by setting the following value in ansible inventory file.
KubeVirt authorization is performed using Kubernetes's Resource Based Authorization Control system (RBAC). RBAC allows cluster admins to grant access to cluster resources by binding RBAC roles to users.
For example, an admin creates an RBAC role that represents the permissions required to create a VirtualMachineInstance. The admin can then bind that role to users in order to grant them the permissions required to launch a VirtualMachineInstance.
With RBAC roles, admins can grant users targeted access to various KubeVirt features.
The kubevirt.io:view ClusterRole gives users permissions to view all KubeVirt resources in the cluster. The permissions to create, delete, modify or access any KubeVirt resources beyond viewing the resource's spec are not included in this role. This means a user with this role could see that a VirtualMachineInstance is running, but neither shutdown nor gain access to that VirtualMachineInstance via console/VNC.
The kubevirt.io:edit ClusterRole gives users permissions to modify all KubeVirt resources in the cluster. For example, a user with this role can create new VirtualMachineInstances, delete VirtualMachineInstances, and gain access to both console and VNC.
The kubevirt.io:admin ClusterRole grants users full permissions to all KubeVirt resources, including the ability to delete collections of resources.
The admin role also grants users access to view and modify the KubeVirt runtime config. This config exists within the Kubevirt Custom Resource under the configuration key in the namespace the KubeVirt operator is running.
NOTE Users are only guaranteed the ability to modify the kubevirt runtime configuration if a ClusterRoleBinding is used. A RoleBinding will work to provide kubevirt CR access only if the RoleBinding targets the same namespace that the kubevirt CR exists in.
"},{"location":"cluster_admin/authorization/#binding-default-clusterroles-to-users","title":"Binding Default ClusterRoles to Users","text":"
The KubeVirt default ClusterRoles are granted to users by creating either a ClusterRoleBinding or RoleBinding object.
"},{"location":"cluster_admin/authorization/#binding-within-all-namespaces","title":"Binding within All Namespaces","text":"
With a ClusterRoleBinding, users receive the permissions granted by the role across all namespaces.
"},{"location":"cluster_admin/authorization/#binding-within-single-namespace","title":"Binding within Single Namespace","text":"
With a RoleBinding, users receive the permissions granted by the role only within a targeted namespace.
"},{"location":"cluster_admin/authorization/#extending-kubernetes-default-roles-with-kubevirt-permissions","title":"Extending Kubernetes Default Roles with KubeVirt permissions","text":"
The aggregated ClusterRole Kubernetes feature facilitates combining multiple ClusterRoles into a single aggregated ClusterRole. This feature is commonly used to extend the default Kubernetes roles with permissions to access custom resources that do not exist in the Kubernetes core.
In order to extend the default Kubernetes roles to provide permission to access KubeVirt resources, we need to add the following labels to the KubeVirt ClusterRoles.
By adding these labels, any user with a RoleBinding or ClusterRoleBinding involving one of the default Kubernetes roles will automatically gain access to the equivalent KubeVirt roles as well.
More information about aggregated cluster roles can be found here
If the default KubeVirt ClusterRoles are not expressive enough, admins can create their own custom RBAC roles to grant user access to KubeVirt resources. The creation of a RBAC role is inclusive only, meaning there's no way to deny access. Instead access is only granted.
Below is an example of what KubeVirt's default admin ClusterRole looks like. A custom RBAC role can be created by reducing the permissions in this example role.
"},{"location":"cluster_admin/customize_components/","title":"Customize components","text":""},{"location":"cluster_admin/customize_components/#customize-kubevirt-components","title":"Customize KubeVirt Components","text":""},{"location":"cluster_admin/customize_components/#customize-components-using-patches","title":"Customize components using patches","text":"
If the patch created is invalid KubeVirt will not be able to update or deploy the system. This is intended for special use cases and should not be used unless you know what you are doing.
Valid resource types are: Deployment, DaemonSet, Service, ValidatingWebhookConfiguraton, MutatingWebhookConfiguration, APIService, and CertificateSecret. More information can be found in the API spec.
The above example will update the virt-controller deployment to have an annotation in it's metadata that says patch: true and will remove the livenessProbe from the container definition.
If the flags are invalid or become invalid on update the component will not be able to run
By using the customize flag option, whichever component the flags are to be applied to, all default flags will be removed and only the flags specified will be used. The available resources to change the flags on are api, controller and handler. You can find our more details about the API in the API spec.
"},{"location":"cluster_admin/device_status_on_Arm64/","title":"Device Status on Arm64","text":"
This page is based on https://github.com/kubevirt/kubevirt/issues/8916
Devices Description Status on Arm64 DisableHotplug supported Disks sata/ virtio bus support virtio bus Watchdog i6300esb not supported UseVirtioTransitional virtio-transitional supported Interfaces e1000/ virtio-net-device support virtio-net-device Inputs tablet virtio/usb bus supported AutoattachPodInterface connect to /net/tun (devices.kubevirt.io/tun) supported AutoattachGraphicsDevice create a virtio-gpu device / vga device support virtio-gpu AutoattachMemBalloon virtio-balloon-pci-non-transitional supported AutoattachInputDevice auto add tablet supported Rng virtio-rng-pci-non-transitional host:/dev/urandom supported BlockMultiQueue \"driver\":\"virtio-blk-pci-non-transitional\",\"num-queues\":$cpu_number supported NetworkInterfaceMultiQueue -netdev tap,fds=21:23:24:25,vhost=on,vhostfds=26:27:28:29,id=hostua-default#fd number equals to queue number supported GPUs not verified Filesystems virtiofs, vhost-user-fs-pci, need to enable featuregate: ExperimentalVirtiofsSupport supported ClientPassthrough https://www.linaro.org/blog/kvm-pciemsi-passthrough-armarm64/on x86_64, iommu need to be enabled not verified Sound ich9/ ac97 not supported TPM tpm-tis-devicehttps://qemu.readthedocs.io/en/latest/specs/tpm.html supported Sriov vfio-pci not verified"},{"location":"cluster_admin/feature_gate_status_on_Arm64/","title":"Feature Gate Status on Arm64","text":"
This page is based on https://github.com/kubevirt/kubevirt/issues/9749 It records the feature gate status on Arm64 platform. Here is the explanation of the status:
Supported: the feature gate support on Arm64 platform.
Not supported yet: there are some dependencies of the feature gate not support Arm64, so this feature does not support for now. We may support the dependencies in the future.
Not supported: The feature gate is not support on Arm64.
Not verified: The feature has not been verified yet.
FEATURE GATE STATUS NOTES ExpandDisksGate Not supported yet CDI is needed CPUManager Supported use taskset to do CPU pinning, do not support kvm-hint-dedicated (this is only works on x86 platform) NUMAFeatureGate Not supported yet Need to support Hugepage on Arm64 IgnitionGate Supported This feature is only used for CoreOS/RhCOS LiveMigrationGate Supported Verified live migration with masquerade network SRIOVLiveMigrationGate Not verified Need two same Machine and SRIOV device HypervStrictCheckGate Not supported Hyperv does not work on Arm64 SidecarGate Supported GPUGate Not verified Need GPU device HostDevicesGate Not verified Need GPU or sound card SnapshotGate Supported Need snapshotter support https://github.com/kubernetes-csi/external-snapshotter VMExportGate Partially supported Need snapshotter support https://kubevirt.io/user-guide/operations/export_api/, support exporting pvc, not support exporting DataVolumes and MemoryDump which rely on CDI HotplugVolumesGate Not supported yet Rely on datavolume and CDI HostDiskGate Supported VirtIOFSGate Supported MacvtapGate Not supported yet quay.io/kubevirt/macvtap-cni not support Arm64, https://github.com/kubevirt/macvtap-cni#deployment PasstGate Supported VM have same ip with pods; start a process for network /usr/bin/passt --runas 107 -e -t 8080 DownwardMetricsFeatureGate need more information It used to let guest get host information, failed on both Arm64 and x86_64. The block is successfully attached and can see the following information: -blockdev {\"driver\":\"file\",\"filename\":\"/var/run/kubevirt-private/downwardapi-disks/vhostmd0\",\"node-name\":\"libvirt-1-storage\",\"cache\":{\"direct\":true,\"no-flush\":false},\"auto-read-only\":true,\"discard\":\"unmap\"}But unable to get information via vm-dump-metrics:LIBMETRICS: read_mdisk(): Unable to read metrics diskLIBMETRICS: get_virtio_metrics(): Unable to export metrics: open(/dev/virtio-ports/org.github.vhostmd.1) No such file or directoryLIBMETRICS: get_virtio_metrics(): Unable to read metrics NonRootDeprecated Supported NonRoot Supported Root Supported ClusterProfiler Supported WorkloadEncryptionSEV Not supported SEV is only available on x86_64 VSOCKGate Supported HotplugNetworkIfacesGate Not supported yet Need to setup multus-cni and multus-dynamic-networks-controller: https://github.com/k8snetworkplumbingwg/multus-cni cat ./deployments/multus-daemonset-thick.yml \\| kubectl apply -f -https://github.com/k8snetworkplumbingwg/multus-dynamic-networks-controller kubectl apply -f manifests/dynamic-networks-controller.yaml Currently, the image ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick does not support Arm64 server. For more information please refer to https://github.com/k8snetworkplumbingwg/multus-cni/pull/1027. CommonInstancetypesDeploymentGate Not supported yet Support of common-instancetypes instancetypes needs to be tested, common-instancetypes preferences for ARM workloads are still missing"},{"location":"cluster_admin/gitops/","title":"Managing KubeVirt with GitOps","text":"
The GitOps way uses Git repositories as a single source of truth to deliver infrastructure as code. Automation is employed to keep the desired and the live state of clusters in sync at all times. This means any change to a repository is automatically applied to one or more clusters while changes to a cluster will be automatically reverted to the state described in the single source of truth.
With GitOps the separation of testing and production environments, improving the availability of applications and working with multi-cluster environments becomes considerably easier.
A few requirements need to be met before you can begin:
Kubernetes cluster or derivative (such as OpenShift) based on a one of the latest three Kubernetes releases that are out at the time the KubeVirt release is made.
Kubernetes apiserver must have --allow-privileged=true in order to run KubeVirt's privileged DaemonSet.
KubeVirt is currently supported on the following container runtimes:
containerd
crio (with runv)
Other container runtimes, which do not use virtualization features, should work too. However, the mentioned ones are the main target.
"},{"location":"cluster_admin/installation/#integration-with-apparmor","title":"Integration with AppArmor","text":"
In most of the scenarios, KubeVirt can run normally on systems with AppArmor. However, there are several known use cases that may require additional user interaction.
On a system with AppArmor enabled, the locally installed profiles may block the execution of the KubeVirt privileged containers. That usually results in initialization failure of the virt-handler pod:
Here, the host AppArmor profile for libvirtd does not allow the execution of the /usr/libexec/qemu-kvm binary. In the future this will hopefully work out of the box (tracking issue), but until then there are a couple of possible workarounds.
The first (and simplest) one is to remove the libvirt package from the host: assuming the host is a dedicated Kubernetes node, you likely won't need it anyway.
If you actually need libvirt to be present on the host, then you can add the following rule to the AppArmor profile for libvirtd (usually /etc/apparmor.d/usr.sbin.libvirtd):
# vim /etc/apparmor.d/usr.sbin.libvirtd\n...\n/usr/libexec/qemu-kvm PUx,\n...\n# apparmor_parser -r /etc/apparmor.d/usr.sbin.libvirtd # or systemctl reload apparmor.service\n
The default AppArmor profile used by the container runtimes usually denies mount call for the workloads. That may prevent from running VMs with VirtIO-FS. This is a known issue. The current workaround is to run such a VM as unconfined by adding the following annotation to the VM or VMI object:
Hardware with virtualization support is recommended. You can use virt-host-validate to ensure that your hosts are capable of running virtualization workloads:
$ virt-host-validate qemu\n QEMU: Checking for hardware virtualization : PASS\n QEMU: Checking if device /dev/kvm exists : PASS\n QEMU: Checking if device /dev/kvm is accessible : PASS\n QEMU: Checking if device /dev/vhost-net exists : PASS\n QEMU: Checking if device /dev/net/tun exists : PASS\n...\n
SELinux-enabled nodes need Container-selinux installed. The minimum version is documented inside the kubevirt/kubevirt repository, in docs/getting-started.md, under \"SELinux support\".
For (older) release branches that don't specify a container-selinux version, version 2.170.0 or newer is recommended.
"},{"location":"cluster_admin/installation/#installing-kubevirt-on-kubernetes","title":"Installing KubeVirt on Kubernetes","text":"
KubeVirt can be installed using the KubeVirt operator, which manages the lifecycle of all the KubeVirt core components. Below is an example of how to install KubeVirt's latest official release. It supports to deploy KubeVirt on both x86_64 and Arm64 platforms.
# Point at latest release\n$ export RELEASE=$(curl https://storage.googleapis.com/kubevirt-prow/release/kubevirt/kubevirt/stable.txt)\n# Deploy the KubeVirt operator\n$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml\n# Create the KubeVirt CR (instance deployment request) which triggers the actual installation\n$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml\n# wait until all KubeVirt components are up\n$ kubectl -n kubevirt wait kv kubevirt --for condition=Available\n
If hardware virtualization is not available, then a software emulation fallback can be enabled using by setting in the KubeVirt CR spec.configuration.developerConfiguration.useEmulation to true as follows:
Note: Prior to release v0.20.0 the condition for the kubectl wait command was named \"Ready\" instead of \"Available\"
Note: Prior to KubeVirt 0.34.2 a ConfigMap called kubevirt-config in the install-namespace was used to configure KubeVirt. Since 0.34.2 this method is deprecated. The configmap still has precedence over configuration on the CR exists, but it will not receive future updates and you should migrate any custom configurations to spec.configuration on the KubeVirt CR.
All new components will be deployed under the kubevirt namespace:
Once privileges are granted, the KubeVirt can be deployed as described above.
"},{"location":"cluster_admin/installation/#web-user-interface-on-okd","title":"Web user interface on OKD","text":"
No additional steps are required to extend OKD's web console for KubeVirt.
The virtualization extension is automatically enabled when KubeVirt deployment is detected.
"},{"location":"cluster_admin/installation/#from-service-catalog-as-an-apb","title":"From Service Catalog as an APB","text":"
You can find KubeVirt in the OKD Service Catalog and install it from there. In order to do that please follow the documentation in the KubeVirt APB repository.
"},{"location":"cluster_admin/installation/#installing-kubevirt-on-k3os","title":"Installing KubeVirt on k3OS","text":"
The following configuration needs to be added to all nodes prior KubeVirt deployment:
k3os:\n modules:\n - kvm\n - vhost_net\n
Once nodes are restarted with this configuration, the KubeVirt can be deployed as described above.
"},{"location":"cluster_admin/installation/#installing-the-daily-developer-builds","title":"Installing the Daily Developer Builds","text":"
KubeVirt releases daily a developer build from the current main branch. One can see when the last release happened by looking at our nightly-build-jobs.
To install the latest developer build, run the following commands:
KubeVirt alone does not bring any additional network plugins, it just allows user to utilize them. If you want to attach your VMs to multiple networks (Multus CNI) or have full control over L2 (OVS CNI), you need to deploy respective network plugins. For more information, refer to OVS CNI installation guide.
Note: KubeVirt Ansible network playbook installs these plugins by default.
You can restrict the placement of the KubeVirt components across your cluster nodes by editing the KubeVirt CR:
The placement of the KubeVirt control plane components (virt-controller, virt-api) is governed by the .spec.infra.nodePlacement field in the KubeVirt CR.
The placement of the virt-handler DaemonSet pods (and consequently, the placement of the VM workloads scheduled to the cluster) is governed by the .spec.workloads.nodePlacement field in the KubeVirt CR.
For each of these .nodePlacement objects, the .affinity, .nodeSelector and .tolerations sub-fields can be configured. See the description in the API reference for further information about using these fields.
For example, to restrict the virt-controller and virt-api pods to only run on the control-plane nodes:
"},{"location":"cluster_admin/ksm/#enabling-ksm-through-kubevirt-cr","title":"Enabling KSM through KubeVirt CR","text":"
KSM can be enabled on nodes by spec.configuration.ksmConfiguration in the KubeVirt CR. ksmConfiguration instructs on which nodes KSM will be enabled, exposing a nodeLabelSelector. nodeLabelSelector is a LabelSelector and defines the filter, based on the node labels. If a node's labels match the label selector term, then on that node, KSM will be enabled.
NOTE If nodeLabelSelector is nil KSM will not be enabled on any nodes. Empty nodeLabelSelector will enable KSM on every node.
"},{"location":"cluster_admin/ksm/#annotation-and-restore-mechanism","title":"Annotation and restore mechanism","text":"
On those nodes where KubeVirt enables the KSM via configuration, an annotation will be added (kubevirt.io/ksm-handler-managed). This annotation is an internal record to keep track of which nodes are currently managed by virt-handler, so that it is possible to distinguish which nodes should be restored in case of future ksmConfiguration changes.
Let's imagine this scenario:
There are 3 nodes in the cluster and one of them(node01) has KSM externally enabled.
An admin patches the KubeVirt CR adding a ksmConfiguration which enables ksm for node02 and node03.
After a while, an admin patches again the KubeVirt CR deleting the ksmConfiguration.
Thanks to the annotation, the virt-handler is able to disable ksm on only those nodes where it itself had enabled it(node02node03), leaving the others unchanged (node01).
KubeVirt can discover on which nodes KSM is enabled and will mark them with a special label (kubevirt.io/ksm-enabled) with value true. This label can be used to schedule the vms in nodes with KSM enabled or not.
Migration policies provides a new way of applying migration configurations to Virtual Machines. The policies can refine Kubevirt CR's MigrationConfiguration that sets the cluster-wide migration configurations. This way, the cluster-wide settings serve as a default that can be refined (i.e. changed, removed or added) by the migration policy.
Please bear in mind that migration policies are in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.
KubeVirt supports Live Migrations of Virtual Machine workloads. Before migration policies were introduced, migration settings could be configurable only on the cluster-wide scope by editing KubevirtCR's spec or more specifically MigrationConfiguration CRD.
Several aspects (although not all) of migration behaviour that can be customized are: - Bandwidth - Auto-convergence - Post/Pre-copy - Max number of parallel migrations - Timeout
Migration policies generalize the concept of defining migration configurations, so it would be possible to apply different configurations to specific groups of VMs.
Such capability can be useful for a lot of different use cases on which there is a need to differentiate between different workloads. Differentiation of different configurations could be needed because different workloads are considered to be in different priorities, security segregation, workloads with different requirements, help to converge workloads which aren't migration-friendly, and many other reasons.
Currently the MigrationPolicy spec will only include the following configurations from KubevirtCR's MigrationConfiguration (in the future more configurations that aren't part of Kubevirt CR are intended to be added):
All above fields are optional. When omitted, the configuration will be applied as defined in KubevirtCR's MigrationConfiguration. This way, KubevirtCR will serve as a configurable set of defaults for both VMs that are not bound to any MigrationPolicy and VMs that are bound to a MigrationPolicy that does not define all fields of the configurations.
"},{"location":"cluster_admin/migration_policies/#matching-policies-to-vms","title":"Matching Policies to VMs","text":"
Next in the spec are the selectors that define the group of VMs on which to apply the policy. The options to do so are the following.
This policy applies to the VMs in namespaces that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1\nkind: MigrationPolicy\n spec:\n selectors:\n namespaceSelector:\n hpc-workloads: true # Matches a key and a value \n
This policy applies for the VMs that have all the required labels:
apiVersion: migrations.kubevirt.io/v1alpha1\nkind: MigrationPolicy\n spec:\n selectors:\n virtualMachineInstanceSelector:\n workload-type: db # Matches a key and a value \n
It is possible that multiple policies apply to the same VMI. In such cases, the precedence is in the same order as the bullets above (VMI labels first, then namespace labels). It is not allowed to define two policies with the exact same selectors.
If multiple policies apply to the same VMI: * The most detailed policy will be applied, that is, the policy with the highest number of matching labels
If multiple policies match to a VMI with the same number of matching labels, the policies will be sorted by the lexicographic order of the matching labels keys. The first one in this order will be applied.
Before removing a kubernetes node from the cluster, users will want to ensure that VirtualMachineInstances have been gracefully terminated before powering down the node. Since all VirtualMachineInstances are backed by a Pod, the recommended method of evicting VirtualMachineInstances is to use the kubectl drain command, or in the case of OKD the oc adm drain command.
"},{"location":"cluster_admin/node_maintenance/#evict-all-vms-from-a-node","title":"Evict all VMs from a Node","text":"
Select the node you'd like to evict VirtualMachineInstances from by identifying the node from the list of cluster nodes.
kubectl get nodes
The following command will gracefully terminate all VMs on a specific node. Replace <node-name> with the name of the node where the eviction should occur.
Below is a break down of why each argument passed to the drain command is required.
kubectl drain <node-name> is selecting a specific node as a target for the eviction
--delete-local-data is a required flag that is necessary for removing any pod that utilizes an emptyDir volume. The VirtualMachineInstance Pod does use emptyDir volumes, however the data in those volumes are ephemeral which means it is safe to delete after termination.
--ignore-daemonsets=true is a required flag because every node running a VirtualMachineInstance will also be running our helper DaemonSet called virt-handler. DaemonSets are not allowed to be evicted using kubectl drain. By default, if this command encounters a DaemonSet on the target node, the command will fail. This flag tells the command it is safe to proceed with the eviction and to just ignore DaemonSets.
--force is a required flag because VirtualMachineInstance pods are not owned by a ReplicaSet or DaemonSet controller. This means kubectl can't guarantee that the pods being terminated on the target node will get re-scheduled replacements placed else where in the cluster after the pods are evicted. KubeVirt has its own controllers which manage the underlying VirtualMachineInstance pods. Each controller behaves differently to a VirtualMachineInstance being evicted. That behavior is outlined further down in this document.
--pod-selector=kubevirt.io=virt-launcher means only VirtualMachineInstance pods managed by KubeVirt will be removed from the node.
"},{"location":"cluster_admin/node_maintenance/#evict-all-vms-and-pods-from-a-node","title":"Evict all VMs and Pods from a Node","text":"
By removing the -pod-selector argument from the previous command, we can issue the eviction of all Pods on a node. This command ensures Pods associated with VMs as well as all other Pods are evicted from the target node.
"},{"location":"cluster_admin/node_maintenance/#evacuate-vmis-via-live-migration-from-a-node","title":"Evacuate VMIs via Live Migration from a Node","text":"
If the LiveMigration feature gate is enabled, it is possible to specify an evictionStrategy on VMIs which will react with live-migrations on specific taints on nodes. The following snippet on a VMI or the VMI templates in a VM ensures that the VMI is migrated during node eviction:
Behind the scenes a PodDisruptionBudget is created for each VMI which has an evictionStrategy defined. This ensures that evictions are be blocked on these VMIs and that we can guarantee that a VMI will be migrated instead of shut off.
Note Prior to v0.34 the drain process with live migrations was detached from the kubectl drain itself and required in addition specifying a special taint on the nodes: kubectl taint nodes foo kubevirt.io/drain=draining:NoSchedule. This is no longer needed. The taint will still be respected if provided but is obsolete.
"},{"location":"cluster_admin/node_maintenance/#re-enabling-a-node-after-eviction","title":"Re-enabling a Node after Eviction","text":"
The kubectl drain will result in the target node being marked as unschedulable. This means the node will not be eligible for running new VirtualMachineInstances or Pods.
If it is decided that the target node should become schedulable again, the following command must be run.
kubectl uncordon <node name>
or in the case of OKD.
oc adm uncordon <node name>
"},{"location":"cluster_admin/node_maintenance/#shutting-down-a-node-after-eviction","title":"Shutting down a Node after Eviction","text":"
From KubeVirt's perspective, a node is safe to shutdown once all VirtualMachineInstances have been evicted from the node. In a multi-use cluster where VirtualMachineInstances are being scheduled alongside other containerized workloads, it is up to the cluster admin to ensure all other pods have been safely evicted before powering down the node.
The eviction of any VirtualMachineInstance that is owned by a VirtualMachine set to running=true will result in the VirtualMachineInstance being re-scheduled to another node.
The VirtualMachineInstance in this case will be forced to power down and restart on another node. In the future once KubeVirt introduces live migration support, the VM will be able to seamlessly migrate to another node during eviction.
The eviction of VirtualMachineInstances owned by a VirtualMachineInstanceReplicaSet will result in the VirtualMachineInstanceReplicaSet scheduling replacements for the evicted VirtualMachineInstances on other nodes in the cluster.
Hotplug Network Interfaces are not supported on Arm64, because the image ghcr.io/k8snetworkplumbingwg/multus-cni:snapshot-thick does not support for the Arm64 platform. For more information please refer to https://github.com/k8snetworkplumbingwg/multus-cni/pull/1027.
Hugepages feature is not supported on Arm64. The hugepage mechanism differs between X86_64 and Arm64. Now we only verify KubeVirt on 4k pagesize systems.
"},{"location":"cluster_admin/operations_on_Arm64/#containerized-data-importer","title":"Containerized Data Importer","text":"
For now, we have not supported this project on Arm64, but it is in our plan.
Export API is partially supported on the Arm64 platform. As CDI is not supported yet, the export of DataVolumes and MemoryDump are not supported on Arm64.
Scheduling is the process of matching Pods/VMs to Nodes. By default, the scheduler used is kube-scheduler. Further details can be found at Kubernetes Scheduler Documentation.
Custom schedulers can be used if the default scheduler does not satisfy your needs. For instance, you might want to schedule VMs using a load aware scheduler such as Trimaran Schedulers.
"},{"location":"cluster_admin/scheduler/#creating-a-custom-scheduler","title":"Creating a Custom Scheduler","text":"
KubeVirt is compatible with custom schedulers. The configuration steps are described in the Official Kubernetes Documentation. Please note, the Kubernetes version KubeVirt is running on and the Kubernetes version used to build the custom scheduler have to match. To get the Kubernetes version KubeVirt is running on, you can run the following command:
Pay attention to the Server line. In this case, the Kubernetes version is v1.22.13. You have to checkout the matching Kubernetes version and build the Kubernetes project:
$ cd kubernetes\n$ git checkout v1.22.13\n$ make\n
Then, you can follow the configuration steps described here. Additionally, the ClusterRole system:kube-scheduler needs permissions to use the verbs watch, list and get on StorageClasses.
"},{"location":"cluster_admin/scheduler/#scheduling-vms-with-the-custom-scheduler","title":"Scheduling VMs with the Custom Scheduler","text":"
The second scheduler should be up and running. You can check it with:
$ kubectl get all -n kube-system\n
The deployment my-scheduler should be up and running if everything is setup properly. In order to launch the VM using the custom scheduler, you need to set the SchedulerName in the VM's spec to my-scheduler. Here is an example VM definition:
In case the specified SchedulerName does not match any existing scheduler, the virt-launcher pod will stay in state Pending, until the specified scheduler can be found. You can check if the VM has been scheduled using the my-scheduler checking the virt-launcher pod events associated with the VM. The pod should have been scheduled with my-scheduler.
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vm-fedora-dpc87 2/2 Running 0 24m\n\n$ kubectl describe pod virt-launcher-vm-fedora-dpc87\n[...] \nEvents:\n Type Reason Age From Message\n ---- ------ ---- ---- -------\n Normal Scheduled 21m my-scheduler Successfully assigned default/virt-launcher-vm-fedora-dpc87 to node01\n[...]\n
"},{"location":"cluster_admin/tekton_tasks/#manipulate-pvcs-with-libguestfs-tools","title":"Manipulate PVCs with libguestfs tools","text":"
disk-virt-customize - execute virt-customize commands in PVCs.
disk-virt-sysprep- execute virt-sysprep commands in PVCs.
"},{"location":"cluster_admin/tekton_tasks/#wait-for-virtual-machine-instance-status","title":"Wait for Virtual Machine Instance Status","text":"
wait-for-vmi-status - Waits for a VMI to be running.
"},{"location":"cluster_admin/tekton_tasks/#modify-windows-iso","title":"Modify Windows iso","text":"
modify-windows-iso-file - modifies windows iso (replaces prompt bootloader with no-prompt bootloader) and replaces original iso in PVC with updated one. This helps with automated installation of Windows in EFI boot mode. By default Windows in EFI boot mode uses a prompt bootloader, which will not continue with the boot process until a key is pressed. By replacing it with the non-prompt bootloader no key press is required to boot into the Windows installer.
All these Tasks can be used for creating Pipelines. We prepared example Pipelines which show what can you do with the KubeVirt Tasks.
Windows efi installer - This Pipeline will prepare a Windows 10/11/2k22 datavolume with virtio drivers installed. User has to provide a working link to a Windows 10/11/2k22 iso file. The Pipeline is suitable for Windows versions, which requires EFI (e.g. Windows 10/11/2k22). More information about Pipeline can be found here
Windows customize - This Pipeline will install a SQL server or a VS Code in a Windows VM. More information about Pipeline can be found here
Note
If you define a different namespace for Pipelines and a different namespace for Tasks, you will have to create a cluster resolver object.
By default, example Pipelines create the resulting datavolume in the kubevirt-os-images namespace.
In case you would like to create resulting datavolume in different namespace (by specifying baseDvNamespace attribute in Pipeline), additional RBAC permissions will be required (list of all required RBAC permissions can be found here).
In case you would like to live migrate the VM while a given Pipeline is running, the following prerequisities must be met
KubeVirt has its own node daemon, called virt-handler. In addition to the usual k8s methods of detecting issues on nodes, the virt-handler daemon has its own heartbeat mechanism. This allows for fine-tuned error handling of VirtualMachineInstances.
If a VirtualMachineInstance gets scheduled, the scheduler is only considering nodes where kubevirt.io/schedulable is true. This can be seen when looking on the corresponding pod of a VirtualMachineInstance:
In case there is a communication issue or the host goes down, virt-handler can't update its labels and annotations any-more. Once the last kubevirt.io/heartbeat timestamp is older than five minutes, the KubeVirt node-controller kicks in and sets the kubevirt.io/schedulable label to false. As a consequence no more VMIs will be schedule to this node until virt-handler is connected again.
"},{"location":"cluster_admin/unresponsive_nodes/#deleting-stuck-vmis-when-virt-handler-is-unresponsive","title":"Deleting stuck VMIs when virt-handler is unresponsive","text":"
In cases where virt-handler has some issues but the node is in general fine, a VirtualMachineInstance can be deleted as usual via kubectl delete vmi <myvm>. Pods of a VirtualMachineInstance will be told by the cluster-controllers they should shut down. As soon as the Pod is gone, the VirtualMachineInstance will be moved to Failed state, if virt-handler did not manage to update it's heartbeat in the meantime. If virt-handler could recover in the meantime, virt-handler will move the VirtualMachineInstance to failed state instead of the cluster-controllers.
"},{"location":"cluster_admin/unresponsive_nodes/#deleting-stuck-vmis-when-the-whole-node-is-unresponsive","title":"Deleting stuck VMIs when the whole node is unresponsive","text":"
If the whole node is unresponsive, deleting a VirtualMachineInstance via kubectl delete vmi <myvmi> alone will never remove the VirtualMachineInstance. In this case all pods on the unresponsive node need to be force-deleted: First make sure that the node is really dead. Then delete all pods on the node via a force-delete: kubectl delete pod --force --grace-period=0 <mypod>.
As soon as the pod disappears and the heartbeat from virt-handler timed out, the VMIs will be moved to Failed state. If they were already marked for deletion they will simply disappear. If not, they can be deleted and will disappear almost immediately.
It takes up to five minutes until the KubeVirt cluster components can detect that virt-handler is unhealthy. During that time-frame it is possible that new VMIs are scheduled to the affected node. If virt-handler is not capable of connecting to these pods on the node, the pods will sooner or later go to failed state. As soon as the cluster finally detects the issue, the VMIs will be set to failed by the cluster.
"},{"location":"cluster_admin/updating_and_deletion/","title":"Updating and deletion","text":""},{"location":"cluster_admin/updating_and_deletion/#updating-kubevirt-control-plane","title":"Updating KubeVirt Control Plane","text":"
Zero downtime rolling updates are supported starting with release v0.17.0 onward. Updating from any release prior to the KubeVirt v0.17.0 release is not supported.
Note: Updating is only supported from N-1 to N release.
Updates are triggered one of two ways.
By changing the imageTag value in the KubeVirt CR's spec.
For example, updating from v0.17.0-alpha.1 to v0.17.0 is as simple as patching the KubeVirt CR with the imageTag: v0.17.0 value. From there the KubeVirt operator will begin the process of rolling out the new version of KubeVirt. Existing VM/VMIs will remain uninterrupted both during and after the update succeeds.
Or, by updating the kubevirt operator if no imageTag value is set.
When no imageTag value is set in the kubevirt CR, the system assumes that the version of KubeVirt is locked to the version of the operator. This means that updating the operator will result in the underlying KubeVirt installation being updated as well.
The first way provides a fine granular approach where you have full control over what version of KubeVirt is installed independently of what version of the KubeVirt operator you might be running. The second approach allows you to lock both the operator and operand to the same version.
Newer KubeVirt may require additional or extended RBAC rules. In this case, the #1 update method may fail, because the virt-operator present in the cluster doesn't have these RBAC rules itself. In this case, you need to update the virt-operator first, and then proceed to update kubevirt. See this issue for more details.
Workload updates are supported as an opt in feature starting with v0.39.0
By default, when KubeVirt is updated this only involves the control plane components. Any existing VirtualMachineInstance (VMI) workloads that are running before an update occurs remain 100% untouched. The workloads continue to run and are not interrupted as part of the default update process.
It's important to note that these VMI workloads do involve components such as libvirt, qemu, and virt-launcher, which can optionally be updated during the KubeVirt update process as well. However that requires opting in to having virt-operator perform automated actions on workloads.
Opting in to VMI updates involves configuring the workloadUpdateStrategy field on the KubeVirt CR. This field controls the methods virt-operator will use to when updating the VMI workload pods.
There are two methods supported.
LiveMigrate: Which results in VMIs being updated by live migrating the virtual machine guest into a new pod with all the updated components enabled.
Evict: Which results in the VMI's pod being shutdown. If the VMI is controlled by a higher level VirtualMachine object with runStrategy: always, then a new VMI will spin up in a new pod with updated components.
The least disruptive way to update VMI workloads is to use LiveMigrate. Any VMI workload that is not live migratable will be left untouched. If live migration is not enabled in the cluster, then the only option available for virt-operator managed VMI updates is the Evict method.
Example: Enabling VMI workload updates via LiveMigration
Example: Enabling VMI workload updates via Evict with batch tunings
The batch tunings allow configuring how quickly VMI's are evicted. In large clusters, it's desirable to ensure that VMI's are evicted in batches in order to distribute load.
Example: Enabling VMI workload updates with both LiveMigrate and Evict
When both LiveMigrate and Evict are specified, then any workloads which are live migratable will be guaranteed to be live migrated. Only workloads which are not live migratable will be evicted.
To delete the KubeVirt you should first to delete KubeVirt custom resource and then delete the KubeVirt operator.
$ export RELEASE=v0.17.0\n$ kubectl delete -n kubevirt kubevirt kubevirt --wait=true # --wait=true should anyway be default\n$ kubectl delete apiservices v1.subresources.kubevirt.io # this needs to be deleted to avoid stuck terminating namespaces\n$ kubectl delete mutatingwebhookconfigurations virt-api-mutator # not blocking but would be left over\n$ kubectl delete validatingwebhookconfigurations virt-operator-validator # not blocking but would be left over\n$ kubectl delete validatingwebhookconfigurations virt-api-validator # not blocking but would be left over\n$ kubectl delete -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml --wait=false\n
Note: If by mistake you deleted the operator first, the KV custom resource will get stuck in the Terminating state, to fix it, delete manually finalizer from the resource.
Note: The apiservice and the webhookconfigurations need to be deleted manually due to a bug.
Currently, Node-labeller is partially supported on Arm64 platform. It does not yet support parsing virsh_domcapabilities.xml and capabilities.xml, and extracting related information such as CPU features.
As Hugepages are a precondition of the NUMA feature, and Hugepages are not enabled on the Arm64 platform, the NUMA feature does not work on Arm64.
"},{"location":"cluster_admin/virtual_machines_on_Arm64/#disks-and-volumes","title":"Disks and Volumes","text":"
Arm64 only supports virtio and scsi disk bus types.
"},{"location":"cluster_admin/virtual_machines_on_Arm64/#interface-and-networks","title":"Interface and Networks","text":""},{"location":"cluster_admin/virtual_machines_on_Arm64/#macvlan","title":"macvlan","text":"
We do not support macvlan network because the project https://github.com/kubevirt/macvtap-cni does not support Arm64.
Support for redirection of client's USB device was introduced in release v0.44. This feature is not enabled by default. To enable it, add an empty clientPassthrough under devices, as such:
This configuration currently adds 4 USB slots to the VMI that can only be used with virtctl.
There are two ways of redirecting the same USB devices: Either using its device's vendor and product information or the actual bus and device address information. In Linux, you can gather this info with lsusb, a redacted example below:
"},{"location":"compute/client_passthrough/#using-vendor-and-product","title":"Using Vendor and Product","text":"
Redirecting the Kingston storage device.
virtctl usbredir 0951:1666 vmi-name\n
"},{"location":"compute/client_passthrough/#using-bus-and-device-address","title":"Using Bus and Device address","text":"
Redirecting the integrated camera
virtctl usbredir 01-03 vmi-name\n
"},{"location":"compute/client_passthrough/#requirements-for-virtctl-usbredir","title":"Requirements for virtctl usbredir","text":"
The virtctl command uses an application called usbredirect to handle client's USB device by unplugging the device from the Client OS and channeling the communication between the device and the VMI.
The usbredirect binary comes from the usbredir project and is supported by most Linux distros. You can either fetch the latest release or MSI installer for Windows support.
Managing USB devices requires privileged access in most Operation Systems. The user running virtctl usbredir would need to be privileged or run it in a privileged manner (e.g: with sudo)
The CPU hotplug feature was introduced in KubeVirt v1.0, making it possible to configure the VM workload to allow for adding or removing virtual CPUs while the VM is running.
A virtual CPU (vCPU) is the CPU that is seen to the Guest VM OS. A VM owner can manage the amount of vCPUs from the VM spec template using the CPU topology fields (spec.template.spec.domain.cpu). The cpu object has the integers cores,sockets,threads so that the virtual CPU is calculated by the following formula: cores * sockets * threads.
Before CPU hotplug was introduced, the VM owner could change these integers in the VM template while the VM is running, and they were staged until the next boot cycle. With CPU hotplug, it is possible to patch the sockets integer in the VM template and the change will take effect right away.
Per each new socket that is hot-plugged, the amount of new vCPUs that would be seen by the guest is cores * threads, since the overall calculation of vCPUs is cores * sockets * threads.
"},{"location":"compute/cpu_hotplug/#configure-the-workload-update-strategy","title":"Configure the workload update strategy","text":"
Current implementation of the hotplug process requires the VM to live-migrate. The migration will be triggered automatically by the workload updater. The workload update strategy in the KubeVirt CR must be configured with LiveMigrate, as follows:
"},{"location":"compute/cpu_hotplug/#configure-the-vm-rollout-strategy","title":"Configure the VM rollout strategy","text":"
Hotplug requires a VM rollout strategy of LiveUpdate, so that the changes made to the VM object propagate to the VMI without a restart. This is also done in the KubeVirt CR configuration:
Let's assume we have a running VM with the 4 vCPUs, which were configured with sockets:4 cores:1 threads:1 In the VMI status we can observe the current CPU topology the VM is running with:
Please note the condition HotVCPUChange that indicates the hotplug process is taking place. Also you can notice the VirtualMachineInstanceMigration object that was created for the VM in subject:
NAME PHASE VMI\nkubevirt-workload-update-kflnl Running vm-cirros\n
When the hotplug process has completed, the currentCPUTopology will be updated with the new number of sockets and the migration is marked as successful.
VPCU hotplug is currently not supported by ARM64 architecture.
Current hotplug implementation involves live-migration of the VM workload.
"},{"location":"compute/dedicated_cpu_resources/","title":"Dedicated CPU resources","text":"
Certain workloads, requiring a predictable latency and enhanced performance during its execution would benefit from obtaining dedicated CPU resources. KubeVirt, relying on the Kubernetes CPU manager, is able to pin guest's vCPUs to the host's pCPUs.
"},{"location":"compute/dedicated_cpu_resources/#kubernetes-cpu-manager","title":"Kubernetes CPU manager","text":"
Kubernetes CPU manager is a mechanism that affects the scheduling of workloads, placing it on a host which can allocate Guaranteed resources and pin certain Pod's containers to host pCPUs, if the following requirements are met:
Pod's QoS is Guaranteed
resources requests and limits are equal
all containers in the Pod express CPU and memory requirements
Requested number of CPUs is an Integer
Additional information:
Enabling the CPU manager on Kubernetes
Enabling the CPU manager on OKD
Kubernetes blog explaining the feature
"},{"location":"compute/dedicated_cpu_resources/#requesting-dedicated-cpu-resources","title":"Requesting dedicated CPU resources","text":"
Setting spec.domain.cpu.dedicatedCpuPlacement to true in a VMI spec will indicate the desire to allocate dedicated CPU resource to the VMI
Kubevirt will verify that all the necessary conditions are met, for the Kubernetes CPU manager to pin the virt-launcher container to dedicated host CPUs. Once, virt-launcher is running, the VMI's vCPUs will be pinned to the pCPUS that has been dedicated for the virt-launcher container.
Expressing the desired amount of VMI's vCPUs can be done by either setting the guest topology in spec.domain.cpu (sockets, cores, threads) or spec.domain.resources.[requests/limits].cpu to a whole number integer ([1-9]+) indicating the number of vCPUs requested for the VMI. Number of vCPUs is counted as sockets * cores * threads or if spec.domain.cpu is empty then it takes value from spec.domain.resources.requests.cpu or spec.domain.resources.limits.cpu.
Note: Users should not specify both spec.domain.cpu and spec.domain.resources.[requests/limits].cpu
Note: spec.domain.resources.requests.cpu must be equal to spec.domain.resources.limits.cpu
Note: Multiple cpu-bound microbenchmarks show a significant performance advantage when using spec.domain.cpu.sockets instead of spec.domain.cpu.cores.
"},{"location":"compute/dedicated_cpu_resources/#requesting-dedicated-cpu-for-qemu-emulator","title":"Requesting dedicated CPU for QEMU emulator","text":"
A number of QEMU threads, such as QEMU main event loop, async I/O operation completion, etc., also execute on the same physical CPUs as the VMI's vCPUs. This may affect the expected latency of a vCPU. In order to enhance the real-time support in KubeVirt and provide improved latency, KubeVirt will allocate an additional dedicated CPU, exclusively for the emulator thread, to which it will be pinned. This will effectively \"isolate\" the emulator thread from the vCPUs of the VMI. In case ioThreadsPolicy is set to auto IOThreads will also be \"isolated\" and placed on the same physical CPU as the QEMU emulator thread.
This functionality can be enabled by specifying isolateEmulatorThread: true inside VMI spec's Spec.Domain.CPU section. Naturally, this setting has to be specified in a combination with a dedicatedCpuPlacement: true.
KubeVirt will then add one or two dedicated CPUs for the emulator threads, in a way that completes the total CPU count to be even.
"},{"location":"compute/dedicated_cpu_resources/#identifying-nodes-with-a-running-cpu-manager","title":"Identifying nodes with a running CPU manager","text":"
At this time, Kubernetes doesn't label the nodes that has CPU manager running on it.
KubeVirt has a mechanism to identify which nodes has the CPU manager running and manually add a cpumanager=true label. This label will be removed when KubeVirt will identify that CPU manager is no longer running on the node. This automatic identification should be viewed as a temporary workaround until Kubernetes will provide the required functionality. Therefore, this feature should be manually enabled by activating the CPUManager feature gate to the KubeVirt CR.
When automatic identification is disabled, cluster administrator may manually add the above label to all the nodes when CPU Manager is running.
Nodes' labels are view-able: kubectl describe nodes
Administrators may manually label a missing node: kubectl label node [node_name] cpumanager=true
"},{"location":"compute/dedicated_cpu_resources/#sidecar-containers-and-cpu-allocation-overhead","title":"Sidecar containers and CPU allocation overhead","text":"
Note: In order to run sidecar containers, KubeVirt requires the Sidecar feature gate to be enabled in KubeVirt's CR.
According to the Kubernetes CPU manager model, in order the POD would reach the required QOS level Guaranteed, all containers in the POD must express CPU and memory requirements. At this time, Kubevirt often uses a sidecar container to mount VMI's registry disk. It also uses a sidecar container of it's hooking mechanism. These additional resources can be viewed as an overhead and should be taken into account when calculating a node capacity.
Note: The current defaults for sidecar's resources: CPU: 200mMemory: 64M As the CPU resource is not expressed as a whole number, CPU manager will not attempt to pin the sidecar container to a host CPU.
KubeVirt provides a mechanism for assigning host devices to a virtual machine. This mechanism is generic and allows various types of PCI devices, such as accelerators (including GPUs) or any other devices attached to a PCI bus, to be assigned. It also allows Linux Mediated devices, such as pre-configured virtual GPUs to be assigned using the same mechanism.
"},{"location":"compute/host-devices/#host-preparation-for-pci-passthrough","title":"Host preparation for PCI Passthrough","text":"
Host Devices passthrough requires the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) to be enabled in the BIOS.
To enable IOMMU, depending on the CPU type, a host should be booted with an additional kernel parameter, intel_iommu=on for Intel and amd_iommu=on for AMD.
Append these parameters to the end of the GRUB_CMDLINE_LINUX line in the grub configuration file.
The vfio-pci kernel module should be enabled on the host.
# modprobe vfio-pci\n
"},{"location":"compute/host-devices/#preparation-of-pci-devices-for-passthrough","title":"Preparation of PCI devices for passthrough","text":"
At this time, KubeVirt is only able to assign PCI devices that are using the vfio-pci driver. To prepare a specific device for device assignment, it should first be unbound from its original driver and bound to the vfio-pci driver.
"},{"location":"compute/host-devices/#preparation-of-mediated-devices-such-as-vgpu","title":"Preparation of mediated devices such as vGPU","text":"
In general, configuration of a Mediated devices (mdevs), such as vGPUs, should be done according to the vendor directions. KubeVirt can now facilitate the creation of the mediated devices / vGPUs on the cluster nodes. This assumes that the required vendor driver is already installed on the nodes. See the Mediated devices and virtual GPUs to learn more about this functionality.
Once the mdev is configured, KubeVirt will be able to discover and use it for device assignment.
Administrators can control which host devices are exposed and permitted to be used in the cluster. Permitted host devices in the cluster will need to be allowlisted in KubeVirt CR by its vendor:product selector for PCI devices or mediated device names.
pciVendorSelector is a PCI vendor ID and product ID tuple in the form vendor_id:product_id. This tuple can identify specific types of devices on a host. For example, the identifier 10de:1eb8, shown above, can be found using lspci.
mdevNameSelector is a name of a Mediated device type that can identify specific types of Mediated devices on a host.
You can see what mediated types a given PCI device supports by examining the contents of /sys/bus/pci/devices/SLOT:BUS:DOMAIN.FUNCTION/mdev_supported_types/TYPE/name. For example, if you have an NVIDIA T4 GPU on your system, and you substitute in the SLOT, BUS, DOMAIN, and FUNCTION values that are correct for your system into the above path name, you will see that a TYPE of nvidia-226 contains the selector string GRID T4-2A in its name file.
Taking GRID T4-2A and specifying it as the mdevNameSelector allows KubeVirt to find a corresponding mediated device by matching it against /sys/class/mdev_bus/SLOT:BUS:DOMAIN.FUNCTION/$mdevUUID/mdev_type/name for some values of SLOT:BUS:DOMAIN.FUNCTION and $mdevUUID.
External providers: externalResourceProvider field indicates that this resource is being provided by an external device plugin. In this case, KubeVirt will only permit the usage of this device in the cluster but will leave the allocation and monitoring to an external device plugin.
"},{"location":"compute/host-devices/#starting-a-virtual-machine","title":"Starting a Virtual Machine","text":"
Host devices can be assigned to virtual machines via the gpus and hostDevices fields. The deviceNames can reference both PCI and Mediated device resource names.
In order to passthrough an NVMe device the procedure is very similar to the gpu case. The device needs to be listed under the permittedHostDevice and under hostDevices in the VM declaration.
Currently, the KubeVirt device plugin doesn't allow the user to select a specific device by specifying the address. Therefore, if multiple NVMe devices with the same vendor and product id exist in the cluster, they could be randomly assigned to a VM. If the devices are not on the same node, then the nodeSelector mitigates the issue.
Cluster admin privilege to edit the KubeVirt CR in order to:
Enable the HostDevices feature gate
Edit the permittedHostDevices configuration to expose node USB devices to the cluster
"},{"location":"compute/host-devices/#exposing-usb-devices","title":"Exposing USB Devices","text":"
In order to assign USB devices to your VMI, you'll need to expose those devices to the cluster under a resource name. The device allowlist can be edited in KubeVirt CR under configuration.permittedHostDevices.usb.
For this example, we will use the kubevirt.io/storage resource name for the device with vendor: \"46f4\" and product: \"0001\"1.
After adding the usb configuration under permittedHostDevices to the KubeVirt CR, KubeVirt's device-plugin will expose this resource name and you can use it in your VMI.
"},{"location":"compute/host-devices/#adding-usb-to-your-vm","title":"Adding USB to your VM","text":"
Now, in the VMI configuration, you can add the devices.hostDevices.deviceName and reference the resource name provided in the previous step, and also give it a local name, for example:
You can find a working example, which uses QEMU's emulated USB storage, under examples/vmi-usb.yaml.
"},{"location":"compute/host-devices/#bundle-of-usb-devices","title":"Bundle of USB devices","text":"
You might be interested to redirect more than one USB device to a VMI, for example, a keyboard, a mouse and a smartcard device. The KubeVirt CR supports assigning multiple USB devices under the same resource name, so you could do:
To enable hugepages on Kubernetes, check the official documentation.
To enable hugepages on OKD, check the official documentation.
"},{"location":"compute/hugepages/#pre-allocate-hugepages-on-a-node","title":"Pre-allocate hugepages on a node","text":"
To pre-allocate hugepages on boot time, you will need to specify hugepages under kernel boot parameters hugepagesz=2M hugepages=64 and restart your machine.
You can find more about hugepages under official documentation.
Live migration is a process during which a running Virtual Machine Instance moves to another compute node while the guest workload continues to run and remain accessible.
"},{"location":"compute/live_migration/#enabling-the-live-migration-support","title":"Enabling the live-migration support","text":"
Live migration is enabled by default in recent versions of KubeVirt. Versions prior to v0.56, it must be enabled in the feature gates. The feature gates field in the KubeVirt CR must be expanded by adding the LiveMigration to it.
Virtual machines using a PersistentVolumeClaim (PVC) must have a shared ReadWriteMany (RWX) access mode to be live migrated.
Live migration is not allowed with a pod network binding of bridge interface type ()
Live migration requires ports 49152, 49153 to be available in the virt-launcher pod. If these ports are explicitly specified in masquarade interface, live migration will not function.
Live migration requires the virt-launcher pod's primary network interface to have the same name on both source and target pods.
"},{"location":"compute/live_migration/#initiate-live-migration","title":"Initiate live migration","text":"
Live migration is initiated by posting a VirtualMachineInstanceMigration (VMIM) object to the cluster. The example below starts a migration process for a virtual machine instance vmi-fedora
"},{"location":"compute/live_migration/#using-virtctl-to-initiate-live-migration","title":"Using virtctl to initiate live migration","text":"
Live migration can also be initiated using virtctl
virtctl migrate vmi-fedora\n
"},{"location":"compute/live_migration/#migration-status-reporting","title":"Migration Status Reporting","text":""},{"location":"compute/live_migration/#condition-and-migration-method","title":"Condition and migration method","text":"
When starting a virtual machine instance, it has also been calculated whether the machine is live migratable. The result is being stored in the VMI VMI.status.conditions. The calculation can be based on multiple parameters of the VMI, however, at the moment, the calculation is largely based on the Access Mode of the VMI volumes. Live migration is only permitted when the volume access mode is set to ReadWriteMany. Requests to migrate a non-LiveMigratable VMI will be rejected.
The reported Migration Method is also being calculated during VMI start. BlockMigration indicates that some of the VMI disks require copying from the source to the destination. LiveMigration means that only the instance memory will be copied.
The migration progress status is being reported in the VMI VMI.status. Most importantly, it indicates whether the migration has been Completed or if it Failed.
"},{"location":"compute/live_migration/#canceling-a-live-migration","title":"Canceling a live migration","text":"
Live migration can also be canceled by simply deleting the migration object. A successfully aborted migration will indicate that the abort has been requested Abort Requested, and that it succeeded: Abort Status: Succeeded. The migration in this case will be Completed and Failed.
KubeVirt puts some limits in place, so that migrations don't overwhelm the cluster. By default, it is configured to only run 5 migrations in parallel with an additional limit of a maximum of 2 outbound migrations per node. Finally, every migration is limited to a bandwidth of 64MiB/s.
Bear in mind that most of these configuration can be overridden and fine-tuned to a specified group of VMs. For more information, please see Migration Policies.
"},{"location":"compute/live_migration/#understanding-different-migration-strategies","title":"Understanding different migration strategies","text":"
Live migration is a complex process. During a migration, the source VM needs to transfer its whole state (mainly RAM) to the target VM. If there are enough resources available, such as network bandwidth and CPU power, migrations should converge nicely. If this is not the scenario, however, the migration might get stuck without an ability to progress.
The main factor that affects migrations from the guest perspective is its dirty rate, which is the rate by which the VM dirties memory. Guests with high dirty rate lead to a race during migration. On the one hand, memory would be transferred continuously to the target, and on the other, the same memory would get dirty by the guest. On such scenarios, one could consider to use more advanced migration strategies.
Let's explain the 3 supported migration strategies as of today.
Pre-copy is the default strategy. It should be used for most cases.
The way it works is as following:
The target VM is created, but the guest keeps running on the source VM.
The source starts sending chunks of VM state (mostly memory) to the target. This continues until all of the state has been transferred to the target.
The guest starts executing on the target VM.
The source VM is being removed.
Pre-copy is the safest and fastest strategy for most cases. Furthermore, it can be easily cancelled, can utilize multithreading, and more. If there is no real reason to use another strategy, this is definitely the strategy to go with.
However, on some cases migrations might not converge easily, that is, by the time the chunk of source VM state would be received by the target VM, it would already be mutated by the source VM (which is the VM the guest executes on). There are many reasons for migrations to fail converging, such as a high dirty-rate or low resources like network bandwidth and CPU. On such scenarios, see the following alternative strategies below.
The way post-copy migrations work is as following:
The target VM is created.
The guest is being run on the target VM.
The source starts sending chunks of VM state (mostly memory) to the target.
When the guest, running on the target VM, would access memory:
If the memory exists on the target VM, the guest can access it.
Otherwise, the target VM asks for a chunk of memory from the source VM.
Once all of the memory state is updated at the target VM, the source VM is being removed.
The main idea here is that the guest starts to run immediately on the target VM. This approach has advantages and disadvantages:
advantages:
The same memory chunk is never being transferred twice. This is possible due to the fact that with post-copy it doesn't matter that a page had been dirtied since the guest is already running on the target VM.
This means that a high dirty-rate has much less effect.
Consumes less network bandwidth.
disadvantages:
When using post-copy, the VM state has no one source of truth. When the guest (running on the target VM) writes to memory, this memory is one part of the guest's state, but some other parts of it may still be updated only at the source VM. This situation is generally dangerous, since, for example, if either the target or guest VMs crash the state cannot be recovered.
Slow warmup: when the guest starts executing, no memory is present at the target VM. Therefore, the guest would have to wait for a lot of memory in a short period of time.
Auto-converge is a technique to help pre-copy migrations converge faster without changing the core algorithm of how the migration works.
Since a high dirty-rate is usually the most significant factor for migrations to not converge, auto-converge simply throttles the guest's CPU. If the migration would converge fast enough, the guest's CPU would not be throttled or throttled negligibly. But, if the migration would not converge fast enough, the CPU would be throttled more and more as time goes.
This technique dramatically increases the probability of the migration converging eventually.
"},{"location":"compute/live_migration/#using-a-different-network-for-migrations","title":"Using a different network for migrations","text":"
Live migrations can be configured to happen on a different network than the one Kubernetes is configured to use. That potentially allows for more determinism, control and/or bandwidth, depending on use-cases.
"},{"location":"compute/live_migration/#creating-a-migration-network-on-a-cluster","title":"Creating a migration network on a cluster","text":"
A separate physical network is required, meaning that every node on the cluster has to have at least 2 NICs, and the NICs that will be used for migrations need to be interconnected, i.e. all plugged to the same switch. The examples below assume that eth1 will be used for migrations.
It is also required for the Kubernetes cluster to have multus installed.
If the desired network doesn't include a DHCP server, then whereabouts will be needed as well.
Finally, a NetworkAttachmentDefinition needs to be created in the namespace where KubeVirt is installed. Here is an example:
"},{"location":"compute/live_migration/#configuring-kubevirt-to-migrate-vmis-over-that-network","title":"Configuring KubeVirt to migrate VMIs over that network","text":"
This is just a matter of adding the name of the NetworkAttachmentDefinition to the KubeVirt CR, like so:
That change will trigger a restart of the virt-handler pods, as they get connected to that new network.
From now on, migrations will happen over that network.
"},{"location":"compute/live_migration/#configuring-kubevirtci-for-testing-migration-networks","title":"Configuring KubeVirtCI for testing migration networks","text":"
Developers and people wanting to test the feature before deploying it on a real cluster might want to configure a dedicated migration network in KubeVirtCI.
KubeVirtCI can simply be configured to include a virtual secondary network, as well as automatically install multus and whereabouts. The following environment variables just have to be declared before running make cluster-up:
Depending on the type, the live migration process will copy virtual machine memory pages and disk blocks to the destination. During this process non-locked pages and blocks are being copied and become free for the instance to use again. To achieve a successful migration, it is assumed that the instance will write to the free pages and blocks (pollute the pages) at a lower rate than these are being copied.
In some cases the virtual machine can have a high dirty-rate, which means it will write to different memory pages / disk blocks at a higher rate than these can be copied over. This situation will prevent the migration process from completing in a reasonable amount of time.
In this case, a timeout can be defined so that live migration will either be aborted or switched to post-copy mode (if it's enabled) if it is running for a long period of time.
The timeout is calculated based on the size of the VMI, its memory and the ephemeral disks that are needed to be copied. The configurable parameter completionTimeoutPerGiB, which defaults to 800s, is the maximum amount of time per GiB of data allowed before the migration gets aborted / switched to post-copy mode. For example, with the default value, a VMI with 8GiB of memory will time-out after 6400 seconds.
Live migration will also be aborted when it will be noticed that copying memory doesn't make any progress. The time to wait for live migration to make progress in transferring data is configurable by progressTimeout parameter, which defaults to 150s
Note: While this increases performance it may allow MITM attacks. Be careful.
"},{"location":"compute/mediated_devices_configuration/","title":"Mediated devices and virtual GPUs","text":""},{"location":"compute/mediated_devices_configuration/#configuring-mediated-devices-and-virtual-gpus","title":"Configuring mediated devices and virtual GPUs","text":"
KubeVirt aims to facilitate the configuration of mediated devices on large clusters. Administrators can use the mediatedDevicesConfiguration API in the KubeVirt CR to create or remove mediated devices in a declarative way, by providing a list of the desired mediated device types that they expect to be configured in the cluster.
You can also include the nodeMediatedDeviceTypes option to provide a more specific configuration that targets a specific node or a group of nodes directly with a node selector. The nodeMediatedDeviceTypes option must be used in combination with mediatedDevicesTypes in order to override the global configuration set in the mediatedDevicesTypes section.
KubeVirt will use the provided configuration to automatically create the relevant mdev/vGPU devices on nodes that can support it.
Currently, a single mdev type per card will be configured. The maximum amount of instances of the selected mdev type will be configured per card.
Note: Some vendors, such as NVIDIA, require a driver to be installed on the nodes to provide mediated devices, including vGPUs.
Example snippet of a KubeVirt CR configuration that includes both nodeMediatedDeviceTypes and mediatedDevicesTypes:
"},{"location":"compute/mediated_devices_configuration/#configuration-scenarios","title":"Configuration scenarios","text":""},{"location":"compute/mediated_devices_configuration/#example-large-cluster-with-multiple-cards-on-each-node","title":"Example: Large cluster with multiple cards on each node","text":"
On nodes with multiple cards that can support similar vGPU types, the relevant desired types will be created in a round-robin manner.
For example, considering the following KubeVirt CR configuration:
This cluster has nodes with two different PCIe cards:
Nodes with 3 Tesla T4 cards, where each card can support multiple devices types:
nvidia-222
nvidia-223
nvidia-228
...
Nodes with 2 Tesla V100 cards, where each card can support multiple device types:
nvidia-105
nvidia-108
nvidia-217
nvidia-299
...
KubeVirt will then create the following devices:
Nodes with 3 Tesla T4 cards will be configured with:
16 vGPUs of type nvidia-222 on card 1
2 vGPUs of type nvidia-228 on card 2
16 vGPUs of type nvidia-222 on card 3
Nodes with 2 Tesla V100 cards will be configured with:
16 vGPUs of type nvidia-105 on card 1
2 vGPUs of type nvidia-108 on card 2
"},{"location":"compute/mediated_devices_configuration/#example-single-card-on-a-node-multiple-desired-vgpu-types-are-supported","title":"Example: Single card on a node, multiple desired vGPU types are supported","text":"
When nodes only have a single card, the first supported type from the list will be configured.
For example, consider the following list of desired types, where nvidia-223 and nvidia-224 are supported:
In this case, nvidia-223 will be configured on the node because it is the first supported type in the list."},{"location":"compute/mediated_devices_configuration/#overriding-configuration-on-a-specifc-node","title":"Overriding configuration on a specifc node","text":"
To override the global configuration set by mediatedDevicesTypes, include the nodeMediatedDeviceTypes option, specifying the node selector and the mediatedDevicesTypes that you want to override for that node.
"},{"location":"compute/mediated_devices_configuration/#example-overriding-the-configuration-for-a-specific-node-in-a-large-cluster-with-multiple-cards-on-each-node","title":"Example: Overriding the configuration for a specific node in a large cluster with multiple cards on each node","text":"
In this example, the KubeVirt CR includes the nodeMediatedDeviceTypes option to override the global configuration specifically for node 2, which will only use the nvidia-234 type.
The cluster has two nodes that both have 3 Tesla T4 cards.
Each card can support a long list of types, including:
nvidia-222
nvidia-223
nvidia-224
nvidia-230
...
KubeVirt will then create the following devices:
Node 1
type nvidia-230 on card 1
type nvidia-223 on card 2
Node 2
type nvidia-234 on card 1 and card 2
Node 1 has been configured in a round-robin manner based on the global configuration but node 2 only uses the nvidia-234 that was specified for it.
"},{"location":"compute/mediated_devices_configuration/#updating-and-removing-vgpu-types","title":"Updating and Removing vGPU types","text":"
Changes made to the mediatedDevicesTypes section of the KubeVirt CR will trigger a re-evaluation of the configured mdevs/vGPU types on the cluster nodes.
Any change to the node labels that match the nodeMediatedDeviceTypes nodeSelector in the KubeVirt CR will trigger a similar re-evaluation.
Consequently, mediated devices will be reconfigured or entirely removed based on the updated configuration.
"},{"location":"compute/mediated_devices_configuration/#assigning-vgpumdev-to-a-virtual-machine","title":"Assigning vGPU/MDEV to a Virtual Machine","text":"
See the Host Devices Assignment to learn how to consume the newly created mediated devices/vGPUs.
Kubevirt now supports getting a VM memory dump for analysis purposes. The Memory dump can be used to diagnose, identify and resolve issues in the VM. Typically providing information about the last state of the programs, applications and system before they were terminated or crashed.
Note This memory dump is not used for saving VM state and resuming it later.
The memory dump process mounts a PVC to the virt-launcher in order to get the output in that PVC, hence the hot plug volumes feature gate must be enabled. The feature gates field in the KubeVirt CR must be expanded by adding the HotplugVolumes to it.
The size of the PVC must be big enough to hold the memory dump. The calculation is (VMMemorySize + 100Mi) * FileSystemOverhead, Where VMMemorySize is the memory size, 100Mi is reserved space for the memory dump overhead and FileSystemOverhead is the value used to adjust requested PVC size with the filesystem overhead. also the PVC must have a FileSystem volume mode.
By adding the --output flag, the memory will be dumped to the PVC and then downloaded to the given output path.
$ virtctl memory-dump get myvm --claim-name=memoryvolume --create-claim --output=memoryDump.dump.gz\n
For downloading the last memory dump from the PVC associated with the VM, without triggering another memory dump, use the memory dump download command.
During the process the volumeStatus on the VMI will be updated with the process information such as the attachment pod information and messages, if all goes well once the process is completed, the PVC is unmounted from the virt-launcher pod and the volumeStatus is deleted. A memory dump annotation will be added to the PVC with the memory dump file name.
"},{"location":"compute/memory_dump/#retriggering-the-memory-dump","title":"Retriggering the memory dump","text":"
Getting a new memory dump to the same PVC is possible without the need to use any flag:
$ virtctl memory-dump get my-vm\n
Note Each memory-dump command will delete the previous dump in that PVC.
In order to get a memory dump to a different PVC you need to 'remove' the current memory-dump PVC and then do a new get with the new PVC name.
As mentioned in order to remove the associated memory dump PVC you need to run a 'memory-dump remove' command. This will allow you to replace the current PVC and get the memory dump to a new one.
$ virtctl memory-dump remove my-vm\n
"},{"location":"compute/memory_dump/#handle-the-memory-dump","title":"Handle the memory dump","text":"
Once the memory dump process is completed the PVC will hold the output. You can manage the dump in one of the following ways: - Download the memory dump - Create a pod with troubleshooting tools that will mount the PVC and inspect it within the pod. - Include the memory dump in the VM Snapshot (will include both the memory dump and the disks) to save a snapshot of the VM in that point of time and inspect it when needed. (The VM Snapshot can be exported and downloaded).
The output of the memory dump can be inspected with memory analysis tools for example Volatility3
"},{"location":"compute/memory_hotplug/#configure-the-workload-update-strategy","title":"Configure the Workload Update Strategy","text":"
Configure LiveMigrate as workloadUpdateStrategy in the KubeVirt CR, since the current implementation of the hotplug process requires the VM to live-migrate.
"},{"location":"compute/memory_hotplug/#configure-the-vm-rollout-strategy","title":"Configure the VM rollout strategy","text":"
Finally, set the VM rollout strategy to LiveUpdate, so that the changes made to the VM object propagate to the VMI without a restart. This is also done in the KubeVirt CR configuration:
NOTE: If memory hotplug is enabled/disabled on an already running VM, a reboot is necessary for the changes to take effect.
More information can be found on the VM Rollout Strategies page.
"},{"location":"compute/memory_hotplug/#optional-set-a-cluster-wide-maximum-amount-of-memory","title":"[OPTIONAL] Set a cluster-wide maximum amount of memory","text":"
You can set the maximum amount of memory for the guest using a cluster level setting in the KubeVirt CR.
The VM-level configuration will take precedence over the cluster-wide one.
"},{"location":"compute/memory_hotplug/#memory-hotplug-in-action","title":"Memory Hotplug in Action","text":"
First we enable the VMLiveUpdateFeatures feature gate, set the rollout strategy to LiveUpdate and set LiveMigrate as workloadUpdateStrategy in the KubeVirt CR.
The Virtual Machine will automatically start and once booted it will report the currently available memory to the guest in the status.memory field inside the VMI.
$ kubectl get vmi vm-cirros -o json | jq .status.memory\n
After the hotplug request is processed and the Virtual Machine is live migrated, the new amount of memory should be available to the guest and visible in the VMI object.
$ kubectl get vmi vm-cirros -o json | jq .status.memory\n
Setting spec.nodeSelector requirements, constrains the scheduler to only schedule VMs on nodes, which contain the specified labels. In the following example the vmi contains the labels cpu: slow and storage: fast:
Thus the scheduler will only schedule the vmi to nodes which contain these labels in their metadata. It works exactly like the Pods nodeSelector. See the Pod nodeSelector Documentation for more examples.
"},{"location":"compute/node_assignment/#affinity-and-anti-affinity","title":"Affinity and anti-affinity","text":"
The spec.affinity field allows specifying hard- and soft-affinity for VMs. It is possible to write matching rules against workloads (VMs and Pods) and Nodes. Since VMs are a workload type based on Pods, Pod-affinity affects VMs as well.
An example for podAffinity and podAntiAffinity may look like this:
Affinity and anti-affinity works exactly like the Pods affinity. This includes podAffinity, podAntiAffinity, nodeAffinity and nodeAntiAffinity. See the Pod affinity and anti-affinity Documentation for more examples and details.
"},{"location":"compute/node_assignment/#taints-and-tolerations","title":"Taints and Tolerations","text":"
Affinity as described above, is a property of VMs that attracts them to a set of nodes (either as a preference or a hard requirement). Taints are the opposite - they allow a node to repel a set of VMs.
Taints and tolerations work together to ensure that VMs are not scheduled onto inappropriate nodes. One or more taints are applied to a node; this marks that the node should not accept any VMs that do not tolerate the taints. Tolerations are applied to VMs, and allow (but do not require) the VMs to schedule onto nodes with matching taints.
You add a taint to a node using kubectl taint. For example,
"},{"location":"compute/node_assignment/#node-balancing-with-descheduler","title":"Node balancing with Descheduler","text":"
In some cases we might need to rebalance the cluster on current scheduling policy and load conditions. Descheduler can find pods, which violates e.g. scheduling decisions and evict them based on descheduler policies. Kubevirt VMs are handled as pods with local storage, so by default, descheduler will not evict them. But it can be easily overridden by adding special annotation to the VMI template in the VM:
This annotation will cause, that the descheduler will be able to evict the VM's pod which can then be scheduled by scheduler on different nodes. A VirtualMachine will never restart or re-create a VirtualMachineInstance until the current instance of the VirtualMachineInstance is deleted from the cluster.
When the VM rollout strategy is set to LiveUpdate, changes to a VM's node selector or affinities will dynamically propagate to the VMI (unless the RestartRequired condition is set). Changes to tolerations will not dynamically propagate, and will trigger a RestartRequired condition if changed on a running VM.
Modifications of the node selector / affinities will only take effect on next migration, the change alone will not trigger one.
KubeVirt does not yet support classical Memory Overcommit Management or Memory Ballooning. In other words VirtualMachineInstances can't give back memory they have allocated. However, a few other things can be tweaked to reduce the memory footprint and overcommit the per-VMI memory overhead.
"},{"location":"compute/node_overcommit/#remove-the-graphical-devices","title":"Remove the Graphical Devices","text":"
First the safest option to reduce the memory footprint, is removing the graphical device from the VMI by setting spec.domain.devices.autottachGraphicsDevice to false. See the video and graphics device documentation for further details and examples.
This will save a constant amount of 16MB per VirtualMachineInstance but also disable VNC access.
"},{"location":"compute/node_overcommit/#overcommit-the-guest-overhead","title":"Overcommit the Guest Overhead","text":"
Before you continue, make sure you make yourself comfortable with the Out of Resource Management of Kubernetes.
Every VirtualMachineInstance requests slightly more memory from Kubernetes than what was requested by the user for the Operating System. The additional memory is used for the per-VMI overhead consisting of our infrastructure which is wrapping the actual VirtualMachineInstance process.
In order to increase the VMI density on the node, it is possible to not request the additional overhead by setting spec.domain.resources.overcommitGuestOverhead to true:
This will work fine for as long as most of the VirtualMachineInstances will not request the whole memory. That is especially the case if you have short-lived VMIs. But if you have long-lived VirtualMachineInstances or do extremely memory intensive tasks inside the VirtualMachineInstance, your VMIs will use all memory they are granted sooner or later.
The third option is real memory overcommit on the VMI. In this scenario the VMI is explicitly told that it has more memory available than what is requested from the cluster by setting spec.domain.memory.guest to a value higher than spec.domain.resources.requests.memory.
The following definition requests 1024MB from the cluster but tells the VMI that it has 2048MB of memory available:
For as long as there is enough free memory available on the node, the VMI can happily consume up to 2048MB. This VMI will get the Burstable resource class assigned by Kubernetes (See QoS classes in Kubernetes for more details). The same eviction rules like for Pods apply to the VMI in case the node gets under memory pressure.
Implicit memory overcommit is disabled by default. This means that when memory request is not specified, it is set to match spec.domain.memory.guest. However, it can be enabled using spec.configuration.developerConfiguration.memoryOvercommit in the kubevirt CR. For example, by setting memoryOvercommit: \"150\" we define that when memory request is not explicitly set, it will be implicitly set to achieve memory overcommit of 150%. For instance, when spec.domain.memory.guest: 3072M, memory request is set to 2048M, if omitted. Note that the actual memory request depends on additional configuration options like OvercommitGuestOverhead.
"},{"location":"compute/node_overcommit/#configuring-the-memory-pressure-behavior-of-nodes","title":"Configuring the memory pressure behavior of nodes","text":"
If the node gets under memory pressure, depending on the kubelet configuration the virtual machines may get killed by the OOM handler or by the kubelet itself. It is possible to tweak that behaviour based on the requirements of your VirtualMachineInstances by:
Configuring Soft Eviction Thresholds
Configuring Hard Eviction Thresholds
Requesting the right QoS class for VirtualMachineInstances
Note: Soft Eviction will effectively shutdown VirtualMachineInstances. They are not paused, hibernated or migrated. Further, Soft Eviction is disabled by default.
If configured, VirtualMachineInstances get evicted once the available memory falls below the threshold specified via --eviction-soft and the VirtualmachineInstance is given the chance to perform a shutdown of the VMI within a timespan specified via --eviction-max-pod-grace-period. The flag --eviction-soft-grace-period specifies for how long a soft eviction condition must be held before soft evictions are triggered.
If set properly according to the demands of the VMIs, overcommitting should only lead to soft evictions in rare cases for some VMIs. They may even get re-scheduled to the same node with less initial memory demand. For some workload types, this can be perfectly fine and lead to better overall memory-utilization.
"},{"location":"compute/node_overcommit/#configuring-hard-eviction-thresholds","title":"Configuring Hard Eviction Thresholds","text":"
Note: If unspecified, the kubelet will do hard evictions for Pods once memory.available falls below 100Mi.
Limits set via --eviction-hard will lead to immediate eviction of VirtualMachineInstances or Pods. This stops VMIs without a grace period and is comparable with power-loss on a real computer.
If the hard limit is hit, VMIs may from time to time simply be killed. They may be re-scheduled to the same node immediately again, since they start with less memory consumption again. This can be a simple option, if the memory threshold is only very seldom hit and the work performed by the VMIs is reproducible or it can be resumed from some checkpoints.
"},{"location":"compute/node_overcommit/#requesting-the-right-qos-class-for-virtualmachineinstances","title":"Requesting the right QoS Class for VirtualMachineInstances","text":"
Different QoS classes get assigned to Pods and VirtualMachineInstances based on the requests.memory and limits.memory. KubeVirt right now supports the QoS classes Burstable and Guaranteed. Burstable VMIs are evicted before Guaranteed VMIs.
This allows creating two classes of VMIs:
One type can have equal requests.memory and limits.memory set and therefore gets the Guaranteed class assigned. This one will not get evicted and should never run into memory issues, but is more demanding.
One type can have no limits.memory or a limits.memory which is greater than requests.memory and therefore gets the Burstable class assigned. These VMIs will be evicted first.
"},{"location":"compute/node_overcommit/#setting-system-reserved-and-kubelet-reserved","title":"Setting --system-reserved and --kubelet-reserved","text":"
It may be important to reserve some memory for other daemons (not DaemonSets) which are running on the same node (ssh, dhcp servers, etc). The reservation can be done with the --system reserved switch. Further for the Kubelet and Docker a special flag called --kubelet-reserved exists.
The KSM (Kernel same-page merging) daemon can be started on the node. Depending on its tuning parameters it can more or less aggressively try to merge identical pages between applications and VirtualMachineInstances. The more aggressive it is configured the more CPU it will use itself, so the memory overcommit advantages comes with a slight CPU performance hit.
Config file tuning allows changes to scanning frequency (how often will KSM activate) and aggressiveness (how many pages per second will it scan).
Note: This will definitely make sure that your VirtualMachines can't crash or get evicted from the node but it comes with the cost of pretty unpredictable performance once the node runs out of memory and the kubelet may not detect that it should evict Pods to increase the performance again.
Enabling swap is in general not recommended on Kubernetes right now. However, it can be useful in combination with KSM, since KSM merges identical pages over time. Swap allows the VMIs to successfully allocate memory which will then effectively never be used because of the later de-duplication done by KSM.
"},{"location":"compute/node_overcommit/#node-cpu-allocation-ratio","title":"Node CPU allocation ratio","text":"
KubeVirt runs Virtual Machines in a Kubernetes Pod. This pod requests a certain amount of CPU time from the host. On the other hand, the Virtual Machine is being created with a certain amount of vCPUs. The number of vCPUs may not necessarily correlate to the number of requested CPUs by the POD. Depending on the QOS of the POD, vCPUs can be scheduled on a variable amount of physical CPUs; this depends on the available CPU resources on a node. When there are fewer available CPUs on the node as the requested vCPU, vCPU will be over committed.
By default, each pod requests 100mil of CPU time. The CPU requested on the pod sets the cgroups cpu.shares which serves as a priority for the scheduler to provide CPU time for vCPUs in this POD. As the number of vCPUs increases, this will reduce the amount of CPU time each vCPU may get when competing with other processes on the node or other Virtual Machine Instances with a lower amount of vCPUs.
The cpuAllocationRatio comes to normalize the amount of CPU time the POD will request based on the number of vCPUs. For example, POD CPU request = number of vCPUs * 1/cpuAllocationRatio When cpuAllocationRatio is set to 1, a full amount of vCPUs will be requested for the POD.
Note: In Kubernetes, one full core is 1000 of CPU time More Information
Administrators can change this ratio by updating the KubeVirt CR
NUMA support in KubeVirt is at this stage limited to a small set of special use-cases and will improve over time together with improvements made to Kubernetes.
In general, the goal is to map the host NUMA topology as efficiently as possible to the Virtual Machine topology to improve the performance.
The following NUMA mapping strategies can be used:
GuestMappingPassthrough will pass through the node numa topology to the guest. The topology is based on the dedicated CPUs which the VMI got assigned from the kubelet via the CPU Manager. It can be requested by setting spec.domain.cpu.numa.guestMappingPassthrough on the VMI.
Since KubeVirt does not know upfront which exclusive CPUs the VMI will get from the kubelet, there are some limitations:
Guests may see different NUMA topologies when being rescheduled.
The resulting NUMA topology may be asymmetrical.
The VMI may fail to start on the node if not enough hugepages are available on the assigned NUMA nodes.
While this NUMA modelling strategy has its limitations, aligning the guest's NUMA architecture with the node's can be critical for high-performance applications.
It is possible to deploy Virtual Machines that run a real-time kernel and make use of libvirtd's guest cpu and memory optimizations that improve the overall latency. These changes leverage mostly on already available settings in KubeVirt, as we will see shortly, but the VMI manifest now exposes two new settings that instruct KubeVirt to configure the generated libvirt XML with the recommended tuning settings for running real-time workloads.
To make use of the optimized settings, two new settings have been added to the VMI schema:
spec.domain.cpu.realtime: When defined, it instructs KubeVirt to configure the linux scheduler for the VCPUS to run processes in FIFO scheduling policy (SCHED_FIFO) with priority 1. This setting guarantees that all processes running in the host will be executed with real-time priority.
spec.domain.cpu.realtime.mask: It defines which VCPUs assigned to the VM are used for real-time. If not defined, libvirt will define all VCPUS assigned to run processes in FIFO scheduling and in the highest priority (1).
A prerequisite to running real-time workloads include locking resources in the cluster to allow the real-time VM exclusive usage. This translates into nodes, or node, that have been configured with a dedicated set of CPUs and also provides support for NUMA with a free number of hugepages of 2Mi or 1Gi size (depending on the configuration in the VMI). Additionally, the node must be configured to allow the scheduler to run processes with real-time policy.
"},{"location":"compute/numa/#nodes-capable-of-running-real-time-workloads","title":"Nodes capable of running real-time workloads","text":"
When the KubeVirt pods are deployed in a node, it will check if it is capable of running processes in real-time scheduling policy and label the node as real-time capable (kubevirt.io/realtime). If, on the other hand, the node is not able to deliver such capability, the label is not applied. To check which nodes are able to host real-time VM workloads run this command:
$>kubectl get nodes -l kubevirt.io/realtime\nNAME STATUS ROLES AGE VERSION\nworker-0-0 Ready worker 12d v1.20.0+df9c838\n
Internally, the KubeVirt pod running in each node checks if the kernel setting kernel.sched_rt_runtime_us equals to -1, which grants processes to run in real-time scheduling policy for an unlimited amount of time.
"},{"location":"compute/numa/#configuring-a-vm-manifest","title":"Configuring a VM Manifest","text":"
Here is an example of a VM manifest that runs a custom fedora container disk configured to run with a real-time kernel. The settings have been configured for optimal efficiency.
CPU: - model: host-passthrough to allow the guest to see host CPU without masking any capability. - dedicated CPU Placement: The VM needs to have dedicated CPUs assigned to it. The Kubernetes CPU Manager takes care of this aspect. - isolatedEmulatorThread: to request an additional CPU to run the emulator on it, thus avoid using CPU cycles from the workload CPUs. - ioThreadsPolicy: Set to auto to let the dedicated IO thread to run in the same CPU as the emulator thread. - NUMA: defining guestMappingPassthrough enables NUMA support for this VM. - realtime: instructs the virt-handler to configure this VM for real-time workloads, such as configuring the VCPUS to use FIFO scheduler policy and set priority to 1. cpu:
When applied this configuration, KubeVirt will only set the first VCPU for real-time scheduler policy, leaving the remaining VCPUS to use the default scheduler policy. Other examples of valid masks are: - 0-3: Use cores 0 to 3 for real-time scheduling, assuming that the VM has requested at least 3 cores. - 0-3,^1: Use cores 0, 2 and 3 for real-time scheduling only, assuming that the VM has requested at least 3 cores.
Kubernetes provides additional NUMA components that may be relevant to your use-case but typically are not enabled by default. Please consult the Kubernetes documentation for details on configuration of these components.
Topology Manager provides optimizations related to CPU isolation, memory and device locality. It is useful, for example, where an SR-IOV network adaptor VF allocation needs to be aligned with a NUMA node.
Memory Manager is analogous to CPU Manager. It is useful, for example, where you want to align hugepage allocations with a NUMA node. It works in conjunction with Topology Manager.
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod. The Memory Manager feeds the central manager (Topology Manager) with these affinity hints. Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
"},{"location":"compute/persistent_tpm_and_uefi_state/","title":"Persistent TPM and UEFI state","text":"
FEATURE STATE: KubeVirt v1.0.0
For both TPM and UEFI, libvirt supports persisting data created by a virtual machine as files on the virtualization host. In KubeVirt, the virtualization host is the virt-launcher pod, which is ephemeral (created on VM start and destroyed on VM stop). As of v1.0.0, KubeVirt supports using a PVC to persist those files. KubeVirt usually refers to that storage area as \"backend storage\".
KubeVirt automatically creates backend storage PVCs for VMs that need it. However, the admin must first enable the VMPersistentState feature gate, and tell KubeVirt which storage class to use by setting the vmStateStorageClass configuration parameter in the KubeVirt Custom Resource (CR). The storage class must support read-write-many (RWX) in filesystem mode (FS). Here's an example of KubeVirt CR that sets both:
As mentioned above, the backend storage PVC can only be created using a storage class that supports RWX FS. There is ongoing work to support block storage in future versions of KubeVirt.
Backend storage is currently incompatible with VM snapshot. It is planned to add snapshot support in the future.
"},{"location":"compute/persistent_tpm_and_uefi_state/#tpm-with-persistent-state","title":"TPM with persistent state","text":"
Since KubeVirt v0.53.0, a TPM device can be added to a VM (with just tpm: {}). However, the data stored in it does not persist across reboots. Support for persistence was added in v1.0.0 using a simple persistent boolean parameter that default to false, to preserve previous behavior. Of course, backend storage must first be configured before adding a persistent TPM to a VM. See above. Here's a portion of a VM definition that includes a persistent TPM:
The Microsoft Windows 11 installer requires the presence of a TPM device, even though it doesn't use this. Persistence is not required in this case however.
Some disk encryption software have optional/mandatory TPM support. For example, Bitlocker requires a persistent TPM device.
The TPM device exposed to the virtual machine is fully emulated (vTPM). The worker nodes do not need to have a TPM device.
When TPM persistence is enabled, the tpm-crb model is used (instead of tpm-tis for non-persistent vTPMs)
A virtual TPM does not provide the same security guarantees as a physical one.
"},{"location":"compute/persistent_tpm_and_uefi_state/#efi-with-persistent-vars","title":"EFI with persistent VARS","text":"
EFI support is handled by libvirt using OVMF. OVMF data usually consists of 2 files, CODE and VARS. VARS is where persistent data from the guest can be stored. When EFI persistence is enabled on a VM, the VARS file will be persisted inside the backend storage. Of course, backend storage must first be configured before enabling EFI persistence on a VM. See above. Here's a portion of a VM definition that includes a persistent EFI:
The boot entries/order can, and most likely will, get overriden by libvirt. This is to satisfy the VM specfications. Do not expect manual boot setting changes to persist.
"},{"location":"compute/resources_requests_and_limits/","title":"Resources requests and limits","text":"
In this document, we are talking about the resources values set on the virt-launcher compute container, referred to as \"the container\" below for simplicity.
Cluster admins can define a label selector in the KubeVirt CR. Once that label selector is defined, if the creation namespace matches the selector, all VM(I)s created in it will have a CPU limits set.
"},{"location":"compute/resources_requests_and_limits/#memory","title":"Memory","text":""},{"location":"compute/resources_requests_and_limits/#memory-requests-on-the-container","title":"Memory requests on the container","text":"
VM(I)s must specify a desired amount of memory, in either spec.domain.memory.guest or spec.domain.resources.requests.memory (ignoring hugepages, see the dedicated page). If both are set, the memory requests take precedence. A calculated amount of overhead will be added to it, forming the memory request value for the container.
"},{"location":"compute/resources_requests_and_limits/#memory-limits-on-the-container","title":"Memory limits on the container","text":"
By default, no memory limit is set on the container
If auto memory limits is enabled (see next section), then the container will have a limit of 2x the requested memory.
Manually setting a memory limit on the VM(I) will set the same value on the container
Memory limits have to be more than memory requests + overhead, otherwise the container will have memory requests > limits and be rejected by Kubernetes.
Memory usage bursts could lead to VM crashes when memory limits are set
KubeVirt provides a feature gate(AutoResourceLimitsGate) to automatically set memory limits on VM(I)s. By enabling this feature gate, memory limits will be added to the vmi if all the following conditions are true:
The namespace where the VMI will be created has a ResourceQuota containing memory limits.
The VMI has no manually set memory limits.
The VMI is not requesting dedicated CPU.
If all the previous conditions are true, the memory limits will be set to a value (2x) of the memory requests. This ratio can be adjusted, per namespace, by adding the annotation alpha.kubevirt.io/auto-memory-limits-ratio, with the desired custom value. For example, with alpha.kubevirt.io/auto-memory-limits-ratio: 1.2, the memory limits set will be equal to (1.2x) of the memory requests.
VirtualMachines have a Running setting that determines whether or not there should be a guest running or not. Because KubeVirt will always immediately restart a VirtualMachineInstance for VirtualMachines with spec.running: true, a simple boolean is not always enough to fully describe desired behavior. For instance, there are cases when a user would like the ability to shut down a guest from inside the virtual machine. With spec.running: true, KubeVirt would immediately restart the VirtualMachineInstance.
To allow for greater variation of user states, the RunStrategy field has been introduced. This is mutually exclusive with Running as they have somewhat overlapping conditions. There are currently four RunStrategies defined:
Always: The system is tasked with keeping the VM in a running state. This is achieved by respawning a VirtualMachineInstance whenever the current one terminated in a controlled (e.g. shutdown from inside the guest) or uncontrolled (e.g. crash) way. This behavior is equal to spec.running: true.
RerunOnFailure: Similar to Always, except that the VM is only restarted if it terminated in an uncontrolled way (e.g. crash) and due to an infrastructure reason (i.e. the node crashed, the KVM related process OOMed). This allows a user to determine when the VM should be shut down by initiating the shut down inside the guest. Note: Guest sided crashes (i.e. BSOD) are not covered by this. In such cases liveness checks or the use of a watchdog can help.
Once: The VM will run once and not be restarted upon completion regardless if the completion is of phase Failure or Success.
Manual: The system will not automatically turn the VM on or off, instead the user manually controlls the VM status by issuing start, stop, and restart commands on the VirtualMachine subresource endpoints.
Halted: The system is asked to ensure that no VM is running. This is achieved by stopping any VirtualMachineInstance that is associated ith the VM. If a guest is already running, it will be stopped. This behavior is equal to spec.running: false.
Note: RunStrategy and running are mutually exclusive, because they can be contradictory. The API server will reject VirtualMachine resources that define both.
The start, stop and restart methods of virtctl will invoke their respective subresources of VirtualMachines. This can have an effect on the runStrategy of the VirtualMachine as below:
RunStrategy start stop restart
Always
-
Halted
Always
RerunOnFailure
RerunOnFailure
RerunOnFailure
RerunOnFailure
Manual
Manual
Manual
Manual
Halted
Always
-
-
Table entries marked with - don't make sense, so won't have an effect on RunStrategy.
Fine-tuning different aspects of the hardware which are not device related (BIOS, mainboard, etc.) is sometimes necessary to allow guest operating systems to properly boot and reboot.
QEMU is able to work with two different classes of chipsets for x86_64, so called machine types. The x86_64 chipsets are i440fx (also called pc) and q35. They are versioned based on qemu-system-${ARCH}, following the format pc-${machine_type}-${qemu_version}, e.g.pc-i440fx-2.10 and pc-q35-2.10.
KubeVirt defaults to QEMU's newest q35 machine type. If a custom machine type is desired, it is configurable through the following structure:
Enabling EFI automatically enables Secure Boot, unless the secureBoot field under efi is set to false. Secure Boot itself requires the SMM CPU feature to be enabled as above, which does not happen automatically, for security reasons.
In order to provide a consistent view on the virtualized hardware for the guest OS, the SMBIOS UUID can be set to a constant value via spec.firmware.uuid:
"},{"location":"compute/virtual_hardware/#labeling-nodes-with-cpu-models-and-cpu-features","title":"Labeling nodes with cpu models and cpu features","text":"
KubeVirt can create node selectors based on VM cpu models and features. With these node selectors, VMs will be scheduled on the nodes that support the matching VM cpu model and features.
To properly label the node, user can use Kubevirt Node-labeller, which creates all necessary labels or create node labels by himself.
Kubevirt node-labeller creates 3 types of labels: cpu models, cpu features and kvm info. It uses libvirt to get all supported cpu models and cpu features on host and then Node-labeller creates labels from cpu models.
Node-labeller supports obsolete list of cpu models and minimal baseline cpu model for features. Both features can be set via KubeVirt CR:
Obsolete cpus will not be inserted in labels. If KubeVirt CR doesn't contain obsoleteCPUModels variable, Labeller sets default values (\"pentium, pentium2, pentium3, pentiumpro, coreduo, n270, core2duo, Conroe, athlon, phenom, kvm32, kvm64, qemu32 and qemu64\").
User can change obsoleteCPUModels by adding / removing cpu model in config map. Kubevirt then update nodes with new labels.
For homogenous cluster / clusters without live migration enabled it's possible to disable the node labeler and avoid adding labels to the nodes by adding the following annotation to the nodes:
Note: If CPU model wasn't defined, the VM will have CPU model closest to one that used on the node where the VM is running.
Note: CPU model is case sensitive.
Setting the CPU model is possible via spec.domain.cpu.model. The following VM will have a CPU with the Conroe model:
apiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this sets the CPU model\n model: Conroe\n...\n
You can check list of available models here.
When CPUNodeDiscovery feature-gate is enabled and VM has cpu model, Kubevirt creates node selector with format: cpu-model.node.kubevirt.io/<cpuModel>, e.g. cpu-model.node.kubevirt.io/Conroe. When VM doesn\u2019t have cpu model, then no node selector is created.
"},{"location":"compute/virtual_hardware/#enabling-default-cluster-cpu-model","title":"Enabling default cluster cpu model","text":"
To enable the default cpu model, user may add the cpuModel field in the KubeVirt CR.
Default CPU model is set when vmi doesn't have any cpu model. When vmi has cpu model set, then vmi's cpu model is preferred. When default cpu model is not set and vmi's cpu model is not set too, host-model will be set. Default cpu model can be changed when kubevirt is running. When CPUNodeDiscovery feature gate is enabled Kubevirt creates node selector with default cpu model.
"},{"location":"compute/virtual_hardware/#cpu-model-special-cases","title":"CPU model special cases","text":"
As special cases you can set spec.domain.cpu.model equals to: - host-passthrough to passthrough CPU from the node to the VM
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this passthrough the node CPU to the VM\n model: host-passthrough\n...\n
host-model to get CPU on the VM close to the node one
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this set the VM CPU close to the node one\n model: host-model\n...\n
Setting CPU features is possible via spec.domain.cpu.features and can contain zero or more CPU features :
metadata:\n name: myvmi\nspec:\n domain:\n cpu:\n # this sets the CPU features\n features:\n # this is the feature's name\n - name: \"apic\"\n # this is the feature's policy\n policy: \"require\"\n...\n
Note: Policy attribute can either be omitted or contain one of the following policies: force, require, optional, disable, forbid.
Note: In case a policy is omitted for a feature, it will default to require.
Behaviour according to Policies:
All policies will be passed to libvirt during virtual machine creation.
In case the feature gate \"CPUNodeDiscovery\" is enabled and the policy is omitted or has \"require\" value, then the virtual machine could be scheduled only on nodes that support this feature.
In case the feature gate \"CPUNodeDiscovery\" is enabled and the policy has \"forbid\" value, then the virtual machine would not be scheduled on nodes that support this feature.
Full description about features and policies can be found here.
When CPUNodeDiscovery feature-gate is enabled Kubevirt creates node selector from cpu features with format: cpu-feature.node.kubevirt.io/<cpuFeature>, e.g. cpu-feature.node.kubevirt.io/apic. When VM doesn\u2019t have cpu feature, then no node selector is created.
hpet is disabled,pit and rtc are configured to use a specific tickPolicy. Finally, hyperv is made available too.
See the Timer API Reference for all possible configuration options.
Note: Timer can be part of a machine type. Thus it may be necessary to explicitly disable them. We may in the future decide to add them via cluster-level defaulting, if they are part of a QEMU machine definition.
"},{"location":"compute/virtual_hardware/#random-number-generator-rng","title":"Random number generator (RNG)","text":"
You may want to use entropy collected by your cluster nodes inside your guest. KubeVirt allows to add a virtio RNG device to a virtual machine as follows.
For Linux guests, the virtio-rng kernel module should be loaded early in the boot process to acquire access to the entropy source. Other systems may require similar adjustments to work with the virtio RNG device.
Note: Some guest operating systems or user payloads may require the RNG device with enough entropy and may fail to boot without it. For example, fresh Fedora images with newer kernels (4.16.4+) may require the virtio RNG device to be present to boot to login.
"},{"location":"compute/virtual_hardware/#video-and-graphics-device","title":"Video and Graphics Device","text":"
By default a minimal Video and Graphics device configuration will be applied to the VirtualMachineInstance. The video device is vga compatible and comes with a memory size of 16 MB. This device allows connecting to the OS via vnc.
It is possible not attach it by setting spec.domain.devices.autoattachGraphicsDevice to false:
KubeVirt supports a range of virtualization features which may be tweaked in order to allow non-Linux based operating systems to properly boot. Most noteworthy are
acpi
apic
hyperv
A common feature configuration is shown by the following example:
See the Features API Reference for all available features and configuration options.
"},{"location":"compute/virtual_hardware/#resources-requests-and-limits","title":"Resources Requests and Limits","text":"
An optional resource request can be specified by the users to allow the scheduler to make a better decision in finding the most suitable Node to place the VM.
Specifying CPU limits will determine the amount of cpu shares set on the control group the VM is running in, in other words, the amount of time the VM's CPUs can execute on the assigned resources when there is a competition for CPU resources.
For more information please refer to how Pods with resource limits are run.
Various VM resources, such as a video adapter, IOThreads, and supplementary system software, consume additional memory from the Node, beyond the requested memory intended for the guest OS consumption. In order to provide a better estimate for the scheduler, this memory overhead will be calculated and added to the requested memory.
Please see how Pods with resource requests are scheduled for additional information on resource requests and limits.
KubeVirt give you possibility to use hugepages as backing memory for your VM. You will need to provide desired amount of memory resources.requests.memory and size of hugepages to use memory.hugepages.pageSize, for example for x86_64 architecture it can be 2Mi.
hugepages size cannot be bigger than requested memory
requested memory must be divisible by hugepages size
hugepages uses by default memfd. Memfd is supported from kernel >= 4.14. If you run on an older host (e.g centos 7.9), it is required to disable memfd with the annotation kubevirt.io/memfd: \"false\" in the VMI metadata annotation.
Kubevirt supports input devices. The only type which is supported is tablet. Tablet input device supports only virtio and usb bus. Bus can be empty. In that case, usb will be selected.
Right now KubeVirt uses virtio-serial for local guest-host communication. Currently it used in KubeVirt by libvirt and qemu to communicate with the qemu-guest-agent. Virtio-serial can also be used by other agents, but it is a little bit cumbersome due to:
A small set of ports on the virtio-serial device
Low bandwidth
No socket based communication possible, which requires every agent to establish their own protocols, or work with translation layers like SLIP to be able to use protocols like gRPC for reliable communication.
No easy and supportable way to get a virtio-serial socket assigned and being able to access it without entering the virt-launcher pod.
Due to the point above, privileges are required for services.
With virtio-vsock we get support for easy guest-host communication which solves the above issues from a user/admin perspective.
NOTE: The /dev/vhost-vsock device is NOT NEEDED to connect or bind to a VSOCK socket.
To make VSOCK feature secure, following measures are put in place:
The whole VSOCK features will live behind a feature gate.
By default the first 1024 ports of a vsock device are privileged. Services trying to bind to those require CAP_NET_BIND_SERVICE capability.
AF_VSOCK socket syscall gets blocked in containerd 1.7+ (containerd/containerd#7442). It is right now the responsibility of the vendor to ensure that the used CRI selects a default seccomp policy which blocks VSOCK socket calls in a similar way like it was done for containerd.
CIDs are assigned by virt-controller and are unique per Virtual Machine Instance to ensure that virt-handler has an easy way of tracking the identity without races. While this still allows virt-launcher to fake-use an assigned CID, it eliminates possible assignment races which attackers could make use-of to redirect VSOCK calls.
Purpose of this document is to explain how to install virtio drivers for Microsoft Windows running in a fully virtualized guest.
"},{"location":"compute/windows_virtio_drivers/#do-i-need-virtio-drivers","title":"Do I need virtio drivers?","text":"
Yes. Without the virtio drivers, you cannot use paravirtualized hardware properly. It would either not work, or will have a severe performance penalty.
For more information about VirtIO and paravirtualization, see VirtIO and paravirtualization
For more details on configuring your VirtIO driver please refer to Installing VirtIO driver on a new Windows virtual machine and Installing VirtIO driver on an existing Windows virtual machine.
"},{"location":"compute/windows_virtio_drivers/#which-drivers-i-need-to-install","title":"Which drivers I need to install?","text":"
There are usually up to 8 possible devices that are required to run Windows smoothly in a virtualized environment. KubeVirt currently supports only:
viostor, the block driver, applies to SCSI Controller in the Other devices group.
viorng, the entropy source driver, applies to PCI Device in the Other devices group.
NetKVM, the network driver, applies to Ethernet Controller in the Other devices group. Available only if a virtio NIC is configured.
Other virtio drivers, that exists and might be supported in the future:
Balloon, the balloon driver, applies to PCI Device in the Other devices group
vioserial, the paravirtual serial driver, applies to PCI Simple Communications Controller in the Other devices group.
vioscsi, the SCSI block driver, applies to SCSI Controller in the Other devices group.
qemupciserial, the emulated PCI serial driver, applies to PCI Serial Port in the Other devices group.
qxl, the paravirtual video driver, applied to Microsoft Basic Display Adapter in the Display adapters group.
pvpanic, the paravirtual panic driver, applies to Unknown device in the Other devices group.
Note
Some drivers are required in the installation phase. When you are installing Windows onto the virtio block storage you have to provide an appropriate virtio driver. Namely, choose viostor driver for your version of Microsoft Windows, eg. does not install XP driver when you run Windows 10.
Other drivers can be installed after the successful windows installation. Again, please install only drivers matching your Windows version.
"},{"location":"compute/windows_virtio_drivers/#how-to-install-during-windows-install","title":"How to install during Windows install?","text":"
To install drivers before the Windows starts its install, make sure you have virtio-win package attached to your VirtualMachine as SATA CD-ROM. In the Windows installation, choose advanced install and load driver. Then please navigate to loaded Virtio CD-ROM and install one of viostor or vioscsi, depending on whichever you have set up.
Step by step screenshots:
"},{"location":"compute/windows_virtio_drivers/#how-to-install-after-windows-install","title":"How to install after Windows install?","text":"
After windows install, please go to Device Manager. There you should see undetected devices in \"available devices\" section. You can install virtio drivers one by one going through this list.
For more details on how to choose a proper driver and how to install the driver, please refer to the Windows Guest Virtual Machines on Red Hat Enterprise Linux 7.
"},{"location":"compute/windows_virtio_drivers/#how-to-obtain-virtio-drivers","title":"How to obtain virtio drivers?","text":"
The virtio Windows drivers are distributed in a form of containerDisk, which can be simply mounted to the VirtualMachine. The container image, containing the disk is located at: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags and the image be pulled as any other docker container:
However, pulling image manually is not required, it will be downloaded if not present by Kubernetes when deploying VirtualMachine.
"},{"location":"compute/windows_virtio_drivers/#attaching-to-virtualmachine","title":"Attaching to VirtualMachine","text":"
KubeVirt distributes virtio drivers for Microsoft Windows in a form of container disk. The package contains the virtio drivers and QEMU guest agent. The disk was tested on Microsoft Windows Server 2012. Supported Windows version is XP and up.
The package is intended to be used as CD-ROM attached to the virtual machine with Microsoft Windows. It can be used as SATA CDROM during install phase or to provide drivers in an existing Windows installation.
Attaching the virtio-win package can be done simply by adding ContainerDisk to you VirtualMachine.
spec:\n domain:\n devices:\n disks:\n - name: virtiocontainerdisk\n # Any other disk you want to use, must go before virtioContainerDisk.\n # KubeVirt boots from disks in order ther are defined.\n # Therefore virtioContainerDisk, must be after bootable disk.\n # Other option is to choose boot order explicitly:\n # - https://kubevirt.io/api-reference/v0.13.2/definitions.html#_v1_disk\n # NOTE: You either specify bootOrder explicitely or sort the items in\n # disks. You can not do both at the same time.\n # bootOrder: 2\n cdrom:\n bus: sata\nvolumes:\n - containerDisk:\n image: quay.io/kubevirt/virtio-container-disk\n name: virtiocontainerdisk\n
Once you are done installing virtio drivers, you can remove virtio container disk by simply removing the disk from yaml specification and restarting the VirtualMachine.
KubeVirt produces a lot of logging throughout its codebase. Some log entries have a verbosity level defined to them. The verbosity level that's defined for a log entry determines the minimum verbosity level in order to expose the log entry.
In code, the log entry looks similar to: log.Log.V(verbosity).Infof(\"...\") while verbosity is the minimum verbosity level for this entry.
For example, if the log verbosity for some log entry is 3, then the log would be exposed only if the log verbosity is defined to be equal or greater than 3, or else it would be filtered out.
Currently, log verbosity can be defined per-component or per-node. The most updated API is detailed here.
"},{"location":"debug_virt_stack/debug/#setting-verbosity-per-kubevirt-component","title":"Setting verbosity per KubeVirt component","text":"
One way of raising log verbosity is to manually determine it for the different components in KubeVirt CR:
nodeVerbosity is essentially a map from string to int where the key is the node name and the value is the verbosity level. The verbosity level would be defined for all the different components in that node (e.g. virt-handler, virt-launcher, etc).
"},{"location":"debug_virt_stack/debug/#how-to-retrieve-kubevirt-components-logs","title":"How to retrieve KubeVirt components' logs","text":"
In Kubernetes, logs are defined at the Pod level. Therefore, first it's needed to list the Pods of KubeVirt's core components. In order to do that we can first list the Pods under KubeVirt's install namespace.
Then, we can pick one of the pods and fetch its logs. For example:
$> kubectl logs -n <KubeVirt Install Namespace> virt-handler-2m86x | head -n8\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"set verbosity to 2\",\"pos\":\"virt-handler.go:453\",\"timestamp\":\"2022-04-17T08:58:37.373695Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"set verbosity to 2\",\"pos\":\"virt-handler.go:453\",\"timestamp\":\"2022-04-17T08:58:37.373726Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"setting rate limiter to 5 QPS and 10 Burst\",\"pos\":\"virt-handler.go:462\",\"timestamp\":\"2022-04-17T08:58:37.373782Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"CPU features of a minimum baseline CPU model: map[apic:true clflush:true cmov:true cx16:true cx8:true de:true fpu:true fxsr:true lahf_lm:true lm:true mca:true mce:true mmx:true msr:true mtrr:true nx:true pae:true pat:true pge:true pni:true pse:true pse36:true sep:true sse:true sse2:true sse4.1:true ssse3:true syscall:true tsc:true]\",\"pos\":\"cpu_plugin.go:96\",\"timestamp\":\"2022-04-17T08:58:37.390221Z\"}\n{\"component\":\"virt-handler\",\"level\":\"warning\",\"msg\":\"host model mode is expected to contain only one model\",\"pos\":\"cpu_plugin.go:103\",\"timestamp\":\"2022-04-17T08:58:37.390263Z\"}\n{\"component\":\"virt-handler\",\"level\":\"info\",\"msg\":\"node-labeller is running\",\"pos\":\"node_labeller.go:94\",\"timestamp\":\"2022-04-17T08:58:37.391011Z\"}\n
Obviously, for both examples above, <KubeVirt Install Namespace> needs to be replaced with the actual namespace KubeVirt is installed in.
Using the cluster-profiler client tool, a developer can get the PProf profiling data for every component in the Kubevirt Control plane. Here is a user guide:
"},{"location":"debug_virt_stack/launch-qemu-gdb/","title":"Launch QEMU with gdb and connect locally with gdb client","text":"
This guide is for cases where QEMU counters very early failures and it is hard to synchronize it in a later point in time.
"},{"location":"debug_virt_stack/launch-qemu-gdb/#image-creation-and-pvc-population","title":"Image creation and PVC population","text":"
This scenario is a slight variation of the guide about starting strace, hence some of the details on the image build and the PVC population are simply skipped and explained in the other section.
In this example, QEMU will be launched with gdbserver and later we will connect to it using a local gdb client.
In this scenario, we use an additional container image containing gdb and the same qemu binary as the target process to debug. This image will be run locally with podman.
In order to build this image, we need to identify the image of the virt-launcher container we want to debug. Based on the KubeVirt installation, the namespace and the name of the KubeVirt CR could vary. In this example, we'll assume that KubeVirt CR is called kubevirt and installed in the kubevirt namespace.
You can easily find out the right names in your cluster by searching with:
$ kubectl get kubevirt -A\nNAMESPACE NAME AGE PHASE\nkubevirt kubevirt 3h11m Deployed\n
The steps to build the image are:
Get the registry of the images of the KubeVirt installation:
Podman will replace the registry and tag arguments provided on the command line. In this way, we can specify the image registry and shasum for the KubeVirt version to debug.
"},{"location":"debug_virt_stack/launch-qemu-gdb/#run-the-vm-to-troubleshoot","title":"Run the VM to troubleshoot","text":"
For this example, we add an annotation to keep the virt-launcher pod running even if any errors occur:
$ kubectl apply -f debug-vmi.yaml\nvirtualmachineinstance.kubevirt.io/vmi-debug-tools created\n$ kubectl get vmi\nNAME AGE PHASE IP NODENAME READY\nvmi-debug-tools 28s Scheduled node01 False\n$ kubectl get po\nNAME READY STATUS RESTARTS AGE\npopulate-pvc-dnxld 0/1 Completed 0 4m17s\nvirt-launcher-vmi-debug-tools-tfh28 4/4 Running 0 25s\n
The wrapping script starts the gdbserver and expose in the port 1234 inside the container. In order to be able to connect remotely to the gdbserver, we can use the command kubectl port-forward to expose the gdb port on our machine.
$ kubectl port-forward virt-launcher-vmi-debug-tools-tfh28 1234\nForwarding from 127.0.0.1:1234 -> 1234\nForwarding from [::1]:1234 -> 1234\n
Finally, we can start the gbd client in the container:
$ podman run -ti --network host gdb-client:latest\n$ gdb /usr/libexec/qemu-kvm -ex 'target remote localhost:1234'\nGNU gdb (GDB) Red Hat Enterprise Linux 10.2-12.el9\nCopyright (C) 2021 Free Software Foundation, Inc.\nLicense GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>\nThis is free software: you are free to change and redistribute it.\nThere is NO WARRANTY, to the extent permitted by law.\nType \"show copying\" and \"show warranty\" for details.\nThis GDB was configured as \"x86_64-redhat-linux-gnu\".\nType \"show configuration\" for configuration details.\nFor bug reporting instructions, please see:\n<https://www.gnu.org/software/gdb/bugs/>.\nFind the GDB manual and other documentation resources online at:\n <http://www.gnu.org/software/gdb/documentation/>.\n\nFor help, type \"help\".\n--Type <RET> for more, q to quit, c to continue without paging--\nType \"apropos word\" to search for commands related to \"word\"...\nReading symbols from /usr/libexec/qemu-kvm...\n\nReading symbols from /root/.cache/debuginfod_client/26221a84fabd219a68445ad0cc87283e881fda15/debuginfo...\nRemote debugging using localhost:1234\nReading /lib64/ld-linux-x86-64.so.2 from remote target...\nwarning: File transfers from remote targets can be slow. Use \"set sysroot\" to access files locally instead.\nReading /lib64/ld-linux-x86-64.so.2 from remote target...\nReading symbols from target:/lib64/ld-linux-x86-64.so.2...\nDownloading separate debug info for /system-supplied DSO at 0x7ffc10eff000...\n0x00007f1a70225e70 in _start () from target:/lib64/ld-linux-x86-64.so.2\n
For simplicity, we started podman with the option --network host in this way, the container is able to access any port mapped on the host.
"},{"location":"debug_virt_stack/launch-qemu-strace/","title":"Launch QEMU with strace","text":"
This guide explains how launch QEMU with a debugging tool in virt-launcher pod. This method can be useful to debug early failures or starting QEMU as a child of the debug tool relying on ptrace. The second point is particularly relevant when a process is operating in a non-privileged environment since otherwise, it would need root access to be able to ptrace the process.
Ephemeral containers are among the emerging techniques to overcome the lack of debugging tool inside the original image. This solution does, however, come with a number of limitations. For example, it is possible to spawn a new container inside the same pod of the application to debug and share the same PID namespace. Though they share the same PID namespace, KubeVirt's usage of unprivileged containers makes it, for example, impossible to ptrace a running container. Therefore, this technique isn't appropriate for our needs.
Due to its security and image size reduction, KubeVirt container images are based on distroless containers. These kinds of images are extremely beneficial for deployments, but they are challenging to troubleshoot because there is no package management, which prevents the installation of additional tools on the flight.
Wrapping the QEMU binary in a script is one practical method for debugging QEMU launched by Libvirt. This script launches the QEMU as a child of this process together with the debugging tool (such as strace or valgrind).
The final part that needs to be added is the configuration for Libvirt to use the wrapped script rather than calling the QEMU program directly.
It is possible to alter the generated XML with the help of KubeVirt sidecars. This allows us to use the wrapping script in place of the built-in emulator.
The primary concept behind this configuration is that all of the additional tools, scripts, and final output files will be stored in a PerstistentVolumeClaim (PVC) that this guide refers to as debug-tools. The virt-launcher pod that we wish to debug will have this PVC attached to it.
In this guide, we'll apply the above concepts to debug QEMU inside virt-launcher using strace without the need of build a custom virt-launcher image.
You can see a full demo of this setup:
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-bring-the-debug-tools-and-wrapping-script-into-distroless-containers","title":"How to bring the debug tools and wrapping script into distroless containers","text":"
This section provides an example of how to provide extra tools into the distroless container that will be supplied as a PVC using a Dockerfile. Although there are several ways to accomplish this, this covers a relatively simple technique. Alternatively, you could run a pod and manually populate the PVC by execing into the pod.
Dockerfile:
FROM quay.io/centos/centos:stream9 as build\n\nENV DIR /debug-tools\nRUN mkdir -p ${DIR}/logs\n\nRUN yum install --installroot=${DIR} -y strace && yum clean all\n\nCOPY ./wrap_qemu_strace.sh $DIR/wrap_qemu_strace.sh\nRUN chmod 0755 ${DIR}/wrap_qemu_strace.sh\nRUN chown 107:107 ${DIR}/wrap_qemu_strace.sh\nRUN chown 107:107 ${DIR}/logs\n
The directory debug-tools stores the content that will be later copied inside the debug-tools PVC. We are essentially adding the missing utilities in the custom directory with yum install --installroot=${DIR}}, and the parent image matches with the parent images of virt-launcher.
The wrap_qemu_strace.sh is the wrapping script that will be used to launch QEMU with strace similarly as the example with valgrind.
It is important to set the dynamic library path LD_LIBRARY_PATH to the path where the PVC will be mounted in the virt-launcher container.
Then, you will simply need to build the image and your debug setup is ready. The Dockerfle and the script wrap_qemu_strace.sh need to be in the same directory where you run the command.
$ podman build -t debug .\n
The second step is to populate the PVC. This can be easily achieved using a kubernetes Job like:
The image referenced in the Job is the image we built in the previous step. Once applied this and the job completed, thedebug-tools PVC is ready to be used.
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-start-qemu-launched-by-a-debugging-tool-eg-strace","title":"How to start qemu launched by a debugging tool (e.g strace)","text":"
This part is achieved by using ConfigMaps and a KubeVirt sidecar (more details in the section Using ConfigMap to run custom script).
The script that replaces the QEMU binary with the wrapping script in the XML is stored in the configmap my-config-map. This script will run as a hook, as explained in full in the documentation for the KubeVirt sidecar.
Once all the objects created, we can finally run the guest to debug.
The VMI example is a simply VM instance declaration and the interesting parts are the annotations for the hook: * image refers to the sidecar-shim already built and shipped with KubeVirt * pvc refers to the PVC populated with the debug setup. The name refers to the claim name, the volumePath is the path inside the sidecar container where the volume is mounted while the sharedComputePath is the path of the same volume inside the compute container. * configMap refers to the confimap containing the script to modify the XML for the wrapping script
Once the VM is declared, the hook will modify the emulator section and Libvirt will call the wrapping script instead of QEMU directly.
"},{"location":"debug_virt_stack/launch-qemu-strace/#how-to-fetch-the-output","title":"How to fetch the output","text":"
The wrapping script configures strace to store the output in the PVC. In this way, it is possible to retrieve the output file in a later time, for example using an additional pod like:
"},{"location":"debug_virt_stack/logging/","title":"Control libvirt logging for each component","text":"
Generally, cluster admins can control the log verbosity of each KubeVirt component in KubeVirt CR. For more details, please, check the KubeVirt documentation.
Nonetheless, regular users can also adjust the qemu component logging to have a finer control over it. The annotation kubevirt.io/libvirt-log-filters enables you to modify each component's log level.
The annotation enables the filter from the container creation. However, in certain cases you might desire to change the logging level dynamically once the container and libvirt have already been started. In this case, virt-admin comes to the rescue.
Otherwise, if you prefer to redirect the output to a file and fetch it later, you can rely on kubectl cp to retrieve the file. In this case, we are saving the file in the /var/run/libvirt directory because the compute container has the permissions to write there.
Example:
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-ephemeral-nqcld 3/3 Running 0 26m\n$ kubectl exec -ti virt-launcher-vmi-ephemeral-nqcld -- virt-admin -c virtqemud:///session daemon-log-outputs \"1:file:/var/run/libvirt/libvirtd.log\"\n$ kubectl cp virt-launcher-vmi-ephemeral-nqcld:/var/run/libvirt/libvirtd.log libvirt-kubevirt.log\ntar: Removing leading `/' from member names\n
"},{"location":"debug_virt_stack/privileged-node-debugging/","title":"Privileged debugging on the node","text":"
This article describes the scenarios in which you can create privileged pods and have root access to the cluster nodes.
With privileged pods, you may access devices in /dev, utilize host namespaces and ptrace processes that are running on the node, and use the hostPath volume to mount node directories in the container.
A quick way to verify if you are allowed to create privileged pods is to create a sample pod with the --dry-run=server option, like:
"},{"location":"debug_virt_stack/privileged-node-debugging/#build-the-container-image","title":"Build the container image","text":"
KubeVirt uses distroless containers and those images don't have a package manager, for this reason it isn't possible to use the image as parent for installing additional packages.
In certain debugging scenarios, the tools require to have exactly the same binary available. However, if the debug tools are operating in a different container, this can be especially difficult as the filesystems of the containers are isolated.
This section will cover how to build a container image with the debug tools plus binaries of the KubeVirt version you want to debug.
Based on your installation the namespace and the name of the KubeVirt CR could vary. In this example, we'll assume that KubeVirt CR is called kubevirt and installed in the kubevirt namespace. You can easily find out how it is called in your cluster by searching with kubectl get kubevirt -A. This is necessary as we need to retrieve the original virt-launcher image to have exactly the same QEMU binary we want to debug.
Get the registry of the images of the KubeVirt installation:
The privileged option is required to have access to mostly all the resources on the node.
The nodeName ensures that the debugging pod will be scheduled on the desired node. In order to select the right now, you can use the -owide option with kubectl get po and this will report the nodes where the pod is running.
Example:
k get pods -owide\nNAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES\nlocal-volume-provisioner-4jtkb 1/1 Running 0 152m 10.244.196.129 node01 <none> <none>\nnode01-debug 1/1 Running 0 44m 192.168.66.101 node01 <none> <none>\nvirt-launcher-vmi-ephemeral-xg98p 3/3 Running 0 2m54s 10.244.196.148 node01 <none> 1/1\n
In the volumes section, you can specify the directories you want to be directly mounted in the debugging container. For example, /usr/lib/modules is particularly useful if you need to load some kernel modules.
Sharing the host pid namespace with the option hostPID allows you to see all the processes on the node and attach to it with tools like gdb and strace.
exec-ing into the pod gives you a shell with privileged access to the node plus the tooling you installed into the image:
$ kubectl exec -ti debug -- bash\n
The following examples assume you have already execed into the node01-debug pod.
"},{"location":"debug_virt_stack/privileged-node-debugging/#validating-the-host-for-virtualization","title":"Validating the host for virtualization","text":"
The tool vist-host-validate is utility to validate the host to run libvirt hypervisor. This, for example, can be used to check if a particular node is kvm capable.
Example:
$ virt-host-validate\n QEMU: Checking for hardware virtualization : PASS\n QEMU: Checking if device /dev/kvm exists : PASS\n QEMU: Checking if device /dev/kvm is accessible : PASS\n QEMU: Checking if device /dev/vhost-net exists : PASS\n QEMU: Checking if device /dev/net/tun exists : PASS\n QEMU: Checking for cgroup 'cpu' controller support : PASS\n QEMU: Checking for cgroup 'cpuacct' controller support : PASS\n QEMU: Checking for cgroup 'cpuset' controller support : PASS\n QEMU: Checking for cgroup 'memory' controller support : PASS\n QEMU: Checking for cgroup 'devices' controller support : PASS\n QEMU: Checking for cgroup 'blkio' controller support : PASS\n QEMU: Checking for device assignment IOMMU support : PASS\n QEMU: Checking if IOMMU is enabled by kernel : PASS\n QEMU: Checking for secure guest support : WARN (Unknown if this platform has Secure\n
"},{"location":"debug_virt_stack/privileged-node-debugging/#run-a-command-directly-on-the-node","title":"Run a command directly on the node","text":"
The debug container has in the volume section the host filesystem mounted under /host. This can be particularly useful if you want to access the node filesystem or execute a command directly on the host. However, the tool needs already to be present on the node.
# chroot /host\nsh-5.1# cat /etc/os-release\nNAME=\"CentOS Stream\"\nVERSION=\"9\"\nID=\"centos\"\nID_LIKE=\"rhel fedora\"\nVERSION_ID=\"9\"\nPLATFORM_ID=\"platform:el9\"\nPRETTY_NAME=\"CentOS Stream 9\"\nANSI_COLOR=\"0;31\"\nLOGO=\"fedora-logo-icon\"\nCPE_NAME=\"cpe:/o:centos:centos:9\"\nHOME_URL=\"https://centos.org/\"\nBUG_REPORT_URL=\"https://bugzilla.redhat.com/\"\nREDHAT_SUPPORT_PRODUCT=\"Red Hat Enterprise Linux 9\"\nREDHAT_SUPPORT_PRODUCT_VERSION=\"CentOS Stream\"\n
"},{"location":"debug_virt_stack/privileged-node-debugging/#attach-to-a-running-process-eg-strace-or-gdb","title":"Attach to a running process (e.g strace or gdb)","text":"
This requires the field hostPID: true in this way you are able to list all the processes running on the node.
"},{"location":"debug_virt_stack/privileged-node-debugging/#debugging-using-crictl","title":"Debugging using crictl","text":"
Crictl is a cli for CRI runtimes and can be particularly useful to troubleshoot container failures (for a more detailed guide, please refer to this Kubernetes article).
In this example, we'll concentrate to find where libvirt creates the files and directory in the compute container of the virt-launcher pod.
"},{"location":"debug_virt_stack/virsh-commands/","title":"Execute virsh commands in virt-launcher pod","text":"
A powerful utility to check and troubleshoot the VM state is virsh and the utility is already installed in the compute container on the virt-launcher pod.
For example, it possible to run any QMP commands.
For a full list of QMP command, please refer to the QEMU documentation.
Then, you can, for example, pause and then unpause the guest and check the triggered events:
$ virtctl pause vmi vmi-ephemeral\nVMI vmi-ephemeral was scheduled to pause\n $ virtctl unpause vmi vmi-ephemeral\nVMI vmi-ephemeral was scheduled to unpause\n
From the monitored events:
$ kubectl exec -ti virt-launcher-vmi-ephemeral-nqcld -- virsh qemu-monitor-event --pretty --loop\nevent STOP at 1698405797.422823 for domain 'default_vmi-ephemeral': <null>\nevent RESUME at 1698405823.162458 for domain 'default_vmi-ephemeral': <null>\n
In order to create unique DNS records per VirtualMachineInstance, it is possible to set spec.hostname and spec.subdomain. If a subdomain is set and a headless service with a name, matching the subdomain, exists, kube-dns will create unique DNS entries for every VirtualMachineInstance which matches the selector of the service. Have a look at the DNS for Services and Pods documentation for additional information.
The following example consists of a VirtualMachine and a headless Service which matches the labels and the subdomain of the VirtualMachineInstance:
As a consequence, when we enter the VirtualMachineInstance via e.g. virtctl console vmi-fedora and ping myvmi.mysubdomain we see that we find a DNS entry for myvmi.mysubdomain.default.svc.cluster.local which points to 10.244.0.57, which is the IP of the VirtualMachineInstance (not of the Service):
[fedora@myvmi ~]$ ping myvmi.mysubdomain\nPING myvmi.mysubdomain.default.svc.cluster.local (10.244.0.57) 56(84) bytes of data.\n64 bytes from myvmi.mysubdomain.default.svc.cluster.local (10.244.0.57): icmp_seq=1 ttl=64 time=0.029 ms\n[fedora@myvmi ~]$ ip a\n2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000\n link/ether 0a:58:0a:f4:00:39 brd ff:ff:ff:ff:ff:ff\n inet 10.244.0.57/24 brd 10.244.0.255 scope global dynamic eth0\n valid_lft 86313556sec preferred_lft 86313556sec\n inet6 fe80::858:aff:fef4:39/64 scope link\n valid_lft forever preferred_lft forever\n
So spec.hostname and spec.subdomain get translated to a DNS A-record of the form <vmi.spec.hostname>.<vmi.spec.subdomain>.<vmi.metadata.namespace>.svc.cluster.local. If no spec.hostname is set, then we fall back to the VirtualMachineInstance name itself. The resulting DNS A-record looks like this then: <vmi.metadata.name>.<vmi.spec.subdomain>.<vmi.metadata.namespace>.svc.cluster.local.
Adding an interface to a KubeVirt Virtual Machine requires first an interface to be added to a running pod. This is not trivial, and has some requirements:
Multus Dynamic Networks Controller: this daemon will listen to annotation changes, and trigger Multus to configure a new attachment for this pod.
Multus CNI running as a thick plugin: this Multus version exposes an endpoint to create attachments for a given pod on demand.
Note: For older Kubevirt versions (from v1.1 until v1.3), the HotplugNICs feature-gate) must be enabled. From Kubevirt v1.4, the FG is not needed and should be removed if set.
"},{"location":"network/hotplug_interfaces/#adding-an-interface-to-a-running-vm","title":"Adding an interface to a running VM","text":"
First start a VM. You can refer to the following example:
You should configure a network attachment definition - where the pod interface configuration is held. The snippet below shows an example of a very simple one:
Please refer to the Multus documentation for more information.
Once the virtual machine is running, and the attachment configuration provisioned, the user can request the interface hotplug operation by editing the VM spec template and adding the desired interface and network:
Note: virtctladdinterface and removeinterface commands are no longer available, hotplug/unplug interfaces is done by editing the VM spec template.
The interface and network will be added to the corresponding VMI object as well by Kubevirt.
You can now check the VMI status for the presence of this new interface:
kubectl get vmi vm-fedora -ojsonpath=\"{ @.status.interfaces }\"\n
"},{"location":"network/hotplug_interfaces/#removing-an-interface-from-a-running-vm","title":"Removing an interface from a running VM","text":"
Following the example above, the user can request an interface unplug operation by editing the VM spec template and set the desired interface state to absent:
The interface in the corresponding VMI object will be set with state 'absent' as well by Kubevirt.
Note: Existing VMs from version v0.59.0 and below do not support hot-unplug interfaces.
"},{"location":"network/hotplug_interfaces/#migration-based-hotplug","title":"Migration based hotplug","text":"
In case your cluster doesn't run Multus as thick plugin and Multus Dynamic Networks controller, it's possible to hotplug an interface by migrating the VM.
The actual attachment won't take place immediately, and the new interface will be available in the guest once the migration is completed.
"},{"location":"network/hotplug_interfaces/#add-new-interface","title":"Add new interface","text":"
Add the desired interface and network to the VM spec template:
Please refer to the Live Migration documentation for more information.
Once the migration is completed the VM will have the new interface attached.
Note: It is recommended to avoid performing migrations in parallel to a hotplug operation. It is safer to assure hotplug succeeded or at least reached the VMI specification before issuing a migration.
Please refer to the Live Migration documentation for more information.
Once the VM is migrated, the interface will not exist in the migration target pod.
Note: It is recommended to avoid performing migrations in parallel to an unplug operation. It is safer to assure unplug succeeded or at least reached the VMI specification before issuing a migration.
Please refer to the Live Migration documentation for more information.
Once the VM is migrated, the interface will not exist in the migration target pod. Due to limitation of Kubernetes device plugin API to allocate resources dynamically, the SR-IOV device plugin cannot allocate additional SR-IOV resources for Kubevirt to hotplug. Thus, SR-IOV interface hotplug is limited to migration based hotplug only, regardless of Multus \"thick\" version.
The hotplugged interfaces have model: virtio. This imposes several limitations: each interface will consume a PCI slot in the VM, and there are a total maximum of 32. Furthermore, other devices will also use these PCI slots (e.g. disks, guest-agent, etc).
Kubevirt reserves resources for 4 interface to allow later hotplug operations. The actual maximum amount of available resources depends on the machine type (e.g. q35 adds another PCI slot). For more information on maximum limits, see libvirt documentation.
Yet, upon a VM restart, the hotplugged interface will become part of the standard networks; this mitigates the maximum hotplug interfaces (per machine type) limitation.
Note: The user can execute this command against a stopped VM - i.e. a VM without an associated VMI. When this happens, KubeVirt mutates the VM spec template on behalf of the user.
"},{"location":"network/interfaces_and_networks/","title":"Interfaces and Networks","text":"
Connecting a virtual machine to a network consists of two parts. First, networks are specified in spec.networks. Then, interfaces backed by the networks are added to the VM by specifying them in spec.domain.devices.interfaces.
Each interface must have a corresponding network with the same name.
An interface defines a virtual network interface of a virtual machine. A network specifies the backend of an interface and declares which logical or physical device it is connected to.
There are multiple ways of configuring an interface as well as a network.
All possible configuration options are available in the Interface API Reference and Network API Reference.
Networks are configured in VMs spec.template.spec.networks. A network must have a unique name.
Each network should declare its type by defining one of the following fields:
Type Description pod Default Kubernetes network multus Secondary network provided using Multus or Primary network when Multus is defined as default"},{"location":"network/interfaces_and_networks/#pod","title":"pod","text":"
Represents the default (aka primary) pod interface (typically eth0) configured by cluster network solution that is present in each pod. The main advantage of this network type is that it is native to Kubernetes, allowing VMs to benefit from all network services provided by Kubernetes.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n masquerade: {}\n networks:\n - name: default\n pod: {} # Stock pod network\n
Secondary networks in Kubernetes allow pods to connect to additional networks beyond the default network, enabling more complex network topologies. These secondary networks are supported by meta-plugins like Multus, which let each pod attach to multiple network interfaces. Kubevirt support the connection of VMs to secondary networks using Multus. This assumes that multus is installed across your cluster and a corresponding NetworkAttachmentDefinition CRD was created.
The following example defines a secondary network which uses the bridge CNI plugin, which will connect the VM to Linux bridge br10. Other CNI plugins such as ptp, bridge-cni or sriov-cni might be used as well. For their installation and usage refer to the respective project documentation.
First the NetworkAttachmentDefinition needs to be created. That is usually done by an administrator. Users can then reference the definition.
With following definition, the VM will be connected to the default pod network and to the secondary bridge network, referencing the NetworkAttachmentDefinition shown above(in the same namespace)
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n masquerade: {}\n - name: bridge-net\n bridge: {}\n networks:\n - name: default\n pod: {} # Stock pod network\n - name: bridge-net\n multus: # Secondary multus network\n networkName: linux-bridge-net-ipam #ref to NAD name\n
"},{"location":"network/interfaces_and_networks/#multus-as-primary-network-provider","title":"Multus as primary network provider","text":"
It is also possible to define a multus network as the default pod network by indicating the VM's spec.template.spec.networks.multus.default=true. See Multus documentation for further information
Note: that a multus default network and a pod network type are mutually exclusive
The multus delegate chosen as default must return at least one IP address.
Network interfaces are configured in spec.domain.devices.interfaces. They describe properties of virtual interfaces as \"seen\" inside guest instances. The same network may be connected to a virtual machine in multiple different ways, each with their own connectivity guarantees and characteristics.
Note networks and interfaces must have a one-to-one relationship
The mandatory interface configuration includes: - A name, which references a network name - The name of supported network core binding from the table below, or a reference to a network binding plugin.
Type Description bridge Connect using a linux bridge sriov Connect using a passthrough SR-IOV VF via vfio masquerade Connect using nftables rules to NAT the traffic both egress and ingress
Each interface may also have additional configuration fields that modify properties \"seen\" inside guest instances, as listed below:
Name Format Default value Description model One of: e1000, e1000e, ne2k_pci, pcnet, rtl8139, virtiovirtio NIC type. Note: Use e1000 model if your guest image doesn't ship with virtio drivers macAddress ff:ff:ff:ff:ff:ff or FF-FF-FF-FF-FF-FF MAC address as seen inside the guest system, for example: de:ad:00:00:be:af ports empty (i.e. all ports) Allow-list of ports to be forwarded to the virtual machine pciAddress 0000:81:00.1 Set network interface PCI address, for example: 0000:81:00.1
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: default\n model: e1000 # expose e1000 NIC to the guest\n masquerade: {} # connect through a masquerade\n ports:\n - name: http\n port: 80 # allow only http traffic ingress\n networks:\n - name: default\n pod: {}\n
Note: For secondary interfaces, when a MAC address is specified for a virtual machine interface, it is passed to the underlying CNI plugin which is, in turn, expected to configure the network provider to allow for this particular MAC. Not every plugin has native support for custom MAC addresses.
Note: For some CNI plugins without native support for custom MAC addresses, there is a workaround, which is to use the tuning CNI plugin to adjust pod interface MAC address. This can be used as follows:
Name Format Required Description name no Name port 1 - 65535 yes Port to expose protocol TCP,UDP no Connection protocol
If spec.domain.devices.interfaces is omitted, the virtual machine is connected using the default pod network interface of bridge type. If you'd like to have a virtual machine instance without any network connectivity, you can use the autoattachPodInterface field as follows:
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n autoattachPodInterface: false\n
In bridge mode, virtual machines are connected to the network backend through a linux \"bridge\". The pod network IPv4 address (if exists) is delegated to the virtual machine via DHCPv4. The virtual machine should be configured to use DHCP to acquire IPv4 addresses.
Note: If a specific MAC address is not configured in the virtual machine interface spec the MAC address from the relevant pod interface is delegated to the virtual machine.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n bridge: {} # connect through a bridge\n networks:\n - name: red\n multus:\n networkName: red\n
At this time, bridge mode doesn't support additional configuration fields.
Note: due to IPv4 address delegation, in bridge mode the pod doesn't have an IP address configured, which may introduce issues with third-party solutions that may rely on it. For example, Istio may not work in this mode.
Note: admin can forbid using bridge interface type for pod networks via a designated configuration flag. To achieve it, the admin should set the following option to false:
Note: binding the pod network using bridge interface type may cause issues. Other than the third-party issue mentioned in the above note, live migration is not allowed with a pod network binding of bridge interface type, and also some CNI plugins might not allow to use a custom MAC address for your VM instances. If you think you may be affected by any of issues mentioned above, consider changing the default interface type to masquerade, and disabling the bridge type for pod network, as shown in the example above.
In masquerade mode, KubeVirt allocates internal IP addresses to virtual machines and hides them behind NAT. All the traffic exiting virtual machines is \"source NAT'ed\" using pod IP addresses; thus, cluster workloads should use the pod's IP address to contact the VM over this interface. This IP address is reported in the VMI's status.interfaces. A guest operating system should be configured to use DHCP to acquire IPv4 addresses.
To allow the VM to live-migrate or hard restart (both cause the VM to run on a different pod, with a different IP address) and still be reachable, it should be exposed by a Kubernetes service.
To allow traffic of specific ports into virtual machines, the template ports section of the interface should be configured as follows. If the ports section is missing, all ports forwarded into the VM.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n masquerade: {} # connect using masquerade mode\n ports:\n - port: 80 # allow incoming traffic on port 80 to get into the virtual machine\n networks:\n - name: red\n pod: {}\n
Note: Masquerade is only allowed to connect to the pod network.
Note: The network CIDR can be configured in the pod network section using the vmNetworkCIDR attribute.
"},{"location":"network/interfaces_and_networks/#masquerade-ipv4-and-ipv6-dual-stack-support","title":"masquerade - IPv4 and IPv6 dual-stack support","text":"
masquerade mode can be used in IPv4 and IPv6 dual-stack clusters to provide a VM with an IP connectivity over both protocols.
As with the IPv4 masquerade mode, the VM can be contacted using the pod's IP address - which will be in this case two IP addresses, one IPv4 and one IPv6. Outgoing traffic is also \"NAT'ed\" to the pod's respective IP address from the given family.
Unlike in IPv4, the configuration of the IPv6 address and the default route is not automatic; it should be configured via cloud init, as shown below:
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n interfaces:\n - name: red\n masquerade: {} # connect using masquerade mode\n ports:\n - port: 80 # allow incoming traffic on port 80 to get into the virtual machine\n networks:\n - name: red\n pod: {}\n
Note: The IPv6 address for the VM and default gateway must be the ones shown above.
masquerade mode can be used in IPv6 single stack clusters to provide a VM with an IPv6 only connectivity.
As with the IPv4 masquerade mode, the VM can be contacted using the pod's IP address - which will be in this case the IPv6 one. Outgoing traffic is also \"NAT'ed\" to the pod's respective IPv6 address.
As with the dual-stack cluster, the configuration of the IPv6 address and the default route is not automatic; it should be configured via cloud init, as shown in the dual-stack section.
Unlike the dual-stack cluster, which has a DHCP server for IPv4, the IPv6 single stack cluster has no DHCP server at all. Therefore, the VM won't have the search domains information and reaching a destination using its FQDN is not possible. Tracking issue - https://github.com/kubevirt/kubevirt/issues/7184
In sriov core network binding, SR-IOV Virtual Functions' PCI devices are directly exposed to virtual machines. SR-IOV device plugin and CNI can be used to manage SR-IOV devices in kubernetes, making them available for kubevirt to consume. The device is passed through into the guest operating system as a host device, using the vfio userspace interface, to maintain high networking performance.
"},{"location":"network/interfaces_and_networks/#how-to-expose-sr-iov-vfs-to-kubevirt","title":"How to expose SR-IOV VFs to KubeVirt","text":"
To simplify procedure, use the SR-IOV network operator to deploy and configure SR-IOV components in your cluster. On how to use the operator, please refer to their respective documentation.
Note: KubeVirt relies on VFIO userspace driver to pass PCI devices into VM guest. Because of that, when configuring SR-IOV operator policies, make sure you define a pool of VF resources that uses deviceType: vfio-pci.
"},{"location":"network/interfaces_and_networks/#start-an-sr-iov-vm","title":"Start an SR-IOV VM","text":"
Assuming that sriov-device-pluginand sriov-cni are deployed on the cluster nodes, create a network-attachment-definition CR as shown here. The name of the CR should correspond with the reference in the VM networks spec (see example below)
Finally, to create a VM that will attach to the aforementioned Network, refer to the following VM spec:
Note: for some NICs (e.g. Mellanox), the kernel module needs to be installed in the guest VM.
Note: Placement on dedicated CPUs can only be achieved if the Kubernetes CPU manager is running on the SR-IOV capable workers. For further details please refer to the dedicated cpu resources documentation.
MAC spoofing refers to the ability to generate traffic with an arbitrary source MAC address. An attacker may use this option to generate attacks on the network.
In order to protect against such scenarios, it is possible to enable the mac-spoof-check support in CNI plugins that support it.
The pod primary network which is served by the cluster network provider is not covered by this documentation. Please refer to the relevant provider to check how to enable spoofing check. The following text refers to the secondary networks, served using multus.
There are two known CNI plugins that support mac-spoof-check:
sriov-cni: Through the spoofchk parameter .
bridge-cni: Through the macspoofchk parameter.
The configuration is to be done on the NetworkAttachmentDefinition by the operator and any interface that refers to it, will have this feature enabled.
Below is an example of using the bridge CNI with macspoofchk enabled:
"},{"location":"network/interfaces_and_networks/#limitations-and-known-issues","title":"Limitations and known issues","text":""},{"location":"network/interfaces_and_networks/#invalid-cnis-for-secondary-networks","title":"Invalid CNIs for secondary networks","text":"
The following list of CNIs is known not to work for bridge interfaces - which are most common for secondary interfaces.
macvlan
ipvlan
The reason is similar: the bridge interface type moves the pod interface MAC address to the VM, leaving the pod interface with a different address. The aforementioned CNIs require the pod interface to have the original MAC address.
These issues are tracked individually:
macvlan
ipvlan
Feel free to discuss and / or propose fixes for them; we'd like to have these plugins as valid options on our ecosystem.
The bridge CNI supports mac-spoof-check through nftables, therefore the node must support nftables and have the nft binary deployed.
There are two methods for the MTU to be propagated to the guest interface.
Libvirt - for this the guest machine needs new enough virtio network driver that understands the data passed into the guest via a PCI config register in the emulated device.
DHCP - for this the guest DHCP client should be able to read the MTU from the DHCP server response.
On Windows guest non virtio interfaces, MTU has to be set manually using netsh or other tool since the Windows DHCP client doesn't request/read the MTU.
The table below is summarizing the MTU propagation to the guest.
masquerade bridge with CNI IP bridge with no CNI IP Windows virtio DHCP & libvirt DHCP & libvirt libvirt libvirt non-virtio DHCP DHCP X X
bridge with CNI IP - means the CNI gives IP to the pod interface and bridge binding is used to bind the pod interface to the guest.
Setting the networkInterfaceMultiqueue to true will enable the multi-queue functionality, increasing the number of vhost queue, for interfaces configured with a virtio model.
# partial example - kept short for brevity \napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nspec:\n template:\n spec:\n domain:\n devices:\n networkInterfaceMultiqueue: true\n
Users of a Virtual Machine with multiple vCPUs may benefit of increased network throughput and performance.
Currently, the number of queues is being determined by the number of vCPUs of a VM. This is because multi-queue support optimizes RX interrupt affinity and TX queue selection in order to make a specific queue private to a specific vCPU.
Without enabling the feature, network performance does not scale as the number of vCPUs increases. Guests cannot transmit or retrieve packets in parallel, as virtio-net has only one TX and RX queue.
Virtio interfaces advertise on their status.interfaces.interface entry a field named queueCount. The queueCount field indicates how many queues were assigned to the interface. Queue count value is derived from the domain XML. In case the number of queues can't be determined (i.e interface that is reported by quest-agent only), it will be omitted.
NOTE: Although the virtio-net multiqueue feature provides a performance benefit, it has some limitations and therefore should not be unconditionally enabled
"},{"location":"network/interfaces_and_networks/#some-known-limitations","title":"Some known limitations","text":"
Guest OS is limited to ~200 MSI vectors. Each NIC queue requires a MSI vector, as well as any virtio device or assigned PCI device. Defining an instance with multiple virtio NICs and vCPUs might lead to a possibility of hitting the guest MSI limit.
virtio-net multiqueue works well for incoming traffic, but can occasionally cause a performance degradation, for outgoing traffic. Specifically, this may occur when sending packets under 1,500 bytes over the Transmission Control Protocol (TCP) stream.
Enabling virtio-net multiqueue increases the total network throughput, but in parallel it also increases the CPU consumption.
Enabling virtio-net multiqueue in the host QEMU config, does not enable the functionality in the guest OS. The guest OS administrator needs to manually turn it on for each guest NIC that requires this feature, using ethtool.
MSI vectors would still be consumed (wasted), if multiqueue was enabled in the host, but has not been enabled in the guest OS by the administrator.
In case the number of vNICs in a guest instance is proportional to the number of vCPUs, enabling the multiqueue feature is less important.
Each virtio-net queue consumes 64 KiB of kernel memory for the vhost driver.
NOTE: Virtio-net multiqueue should be enabled in the guest OS manually, using ethtool. For example: ethtool -L <NIC> combined #num_of_queues
More information please refer to KVM/QEMU MultiQueue.
"},{"location":"network/istio_service_mesh/","title":"Istio service mesh","text":"
Service mesh allows to monitor, visualize and control traffic between pods. Kubevirt supports running VMs as a part of Istio service mesh.
"},{"location":"network/istio_service_mesh/#create-a-virtualmachineinstance-with-enabled-istio-proxy-injecton","title":"Create a VirtualMachineInstance with enabled Istio proxy injecton","text":"
The example below specifies a VMI with masquerade network interface and sidecar.istio.io/inject annotation to register the VM to the service mesh.
Verify istio-proxy sidecar is deployed and able to synchronize with Istio control plane using istioctl proxy-status command. See Istio Debbuging Envoy and Istiod documentation section for more information about proxy-status subcommand.
"},{"location":"network/istio_service_mesh/#troubleshooting","title":"Troubleshooting","text":""},{"location":"network/istio_service_mesh/#istio-sidecar-is-not-deployed","title":"Istio sidecar is not deployed","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-jnw6p 2/2 Running 0 37s\n\n$ kubectl get pods virt-launcher-vmi-istio-jnw6p -o jsonpath='{.spec.containers[*].name}'\ncompute volumecontainerdisk\n
Resolution: Make sure the istio-injection=enabled is added to the target namespace. If the issue persists, consult relevant part of Istio documentation.
"},{"location":"network/istio_service_mesh/#istio-sidecar-is-not-ready","title":"Istio sidecar is not ready","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-lg5gp 2/3 Running 0 90s\n\n$ kubectl describe pod virt-launcher-vmi-istio-lg5gp\n ...\n Warning Unhealthy 2d8h (x3 over 2d8h) kubelet Readiness probe failed: Get \"http://10.244.186.222:15021/healthz/ready\": dial tcp 10.244.186.222:15021: connect: no route to host\n Warning Unhealthy 2d8h (x4 over 2d8h) kubelet Readiness probe failed: Get \"http://10.244.186.222:15021/healthz/ready\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)\n
Resolution: Make sure the sidecar.istio.io/inject: \"true\" annotation is defined in the created VMI and that masquerade or passt binding is used for pod network interface.
"},{"location":"network/istio_service_mesh/#virt-launcher-pod-for-vmi-is-stuck-at-initialization-phase","title":"Virt-launcher pod for VMI is stuck at initialization phase","text":"
$ kubectl get pods\nNAME READY STATUS RESTARTS AGE\nvirt-launcher-vmi-istio-44mws 0/3 Init:0/3 0 29s\n\n$ kubectl describe pod virt-launcher-vmi-istio-44mws\n ...\n Multus: [default/virt-launcher-vmi-istio-44mws]: error loading k8s delegates k8s args: TryLoadPodDelegates: error in getting k8s network for pod: GetNetworkDelegates: failed getting the delegate: getKubernetesDelegate: cannot find a network-attachment-definition (istio-cni) in namespace (default): network-attachment-definitions.k8s.cni.cncf.io \"istio-cni\" not found\n
Resolution: Make sure the istio-cni NetworkAttachmentDefinition (provided in the Prerequisites section) is created in the target namespace.
A modular plugin which integrates with Kubevirt to implement a network binding.
Limited Support: Kubevirt provides regular support for the network binding plugin infrastructure for plugin authors. However, individual network plugin bindings are subject to limited, best-effort support from the Kubevirt community.
Clusters with Kubevirt deployments that utilize a network binding plugin should contact the plugin vendor for support on any issue that may be encountered, be it network or other issue.
In order to request support from the Kubevirt core project and its community, please use a setup without any network binding plugin. The plugin examples listed below are an exception to this rule, as they are maintained by the Kubevirt network core maintainers.
In order for a VM to have access to external network(s), several layers need to be defined and configured, depending on the connectivity characteristics needs.
These layers include:
Host connectivity: Network provider.
Host to Pod connectivity: CNI.
Pod to domain connectivity: Network Binding.
This guide focuses on the Network Binding portion.
The network bindings have been part of Kubevirt core API and codebase. With the increase of the number of network bindings added and frequent requests to tweak and change the existing network bindings, a decision has been made to create a network binding plugin infrastructure.
The plugin infrastructure provides means to compose a network binding plugin and integrate it into Kubevirt in a modular manner.
Kubevirt is providing several network binding plugins as references. The following plugins are available:
Depending on the plugin, some components need to be deployed in the cluster. Not all network binding plugins require all these components, therefore these steps are optional.
Binding CNI plugin: When it is required to change the pod network stack (and a core domain-attachment is not a fit), a custom CNI plugin is composed to serve the network binding plugin.
This binary needs to be deployed on each node of the cluster, like any other CNI plugin.
The binary can be built from source or consumed from an existing artifact.
Note: The location of the CNI plugins binaries depends on the platform used and its configuration. A frequently used path for such binaries is /opt/cni/bin/.
Binding NetworkAttachmentDefinition: It references the binding CNI plugin, with optional configuration settings. The manifest needs to be deployed on the cluster at a namespace which is accessible by the VM and its pod.
Note: It is possible to deploy the NetworkAttachmentDefinition on the default namespace, where all other namespaces can access it. Nevertheless, it is recommended (for security reasons) to define the NetworkAttachmentDefinition in the same namespace the VM resides.
Multus: In order for the network binding CNI and the NetworkAttachmentDefinition to operate, there is a need to have Multus deployed on the cluster. For more information, check the Quickstart Intallation Guide.
Sidecar image: When a core domain-attachment is not a fit, a sidecar is used to configure the vNIC domain configuration. In a more complex scenarios, the sidecar also runs services like DHCP to deliver IP information to the guest.
The sidecar image is built and usually pushed to an image registry for consumption. Therefore, the cluster needs to have access to the image.
The image can be built from source and pushed to an accessible registry or used from a given registry that already contains it.
Feature Gate The network binding plugin is currently (v1.1.0) in Alpha stage, protected by a feature gate (FG) named NetworkBindingPlugins.
It is therefore necessary to set the FG in the Kubevirt CR.
Example (valid when the FG subtree is already defined):
In order to use a network binding plugin, the cluster admin needs to register the binding. Registration includes the addition of the binding name with all its parameters to the Kubevirt CR.
The following (optional) parameters are currently supported:
Use the format to specify the NetworkAttachementDefinition that defines the CNI plugin and the configuration the binding plugin uses. Used when the binding plugin needs to change the pod network namespace."},{"location":"network/network_binding_plugins/#sidecarimage","title":"sidecarImage","text":"
From: v1.1.0
Specify a container image in a registry. Used when the binding plugin needs to modify the domain vNIC configuration or when a service needs to be executed (e.g. DHCP server).
The Domain Attachment type is a pre-defined core kubevirt method to attach an interface to the domain.
Specify the name of a core domain attachment type. A possible alternative to a sidecar, to configure the domain vNIC.
Supported types:
tap (from v1.1.1): The domain configuration is set to use an existing tap device. It also supports existing macvtap devices.
When both the domainAttachmentType and sidecarImage are specified, the domain will first be configured according to the domainAttachmentType and then the sidecarImage may modify it.
Specify whether the network binding plugin supports migration. It is possible to specify a migration method. Supported migration method types: - link-refresh (from v1.2.0): after migration, the guest nic will be deactivated and then activated again. It can be useful to renew the DHCP lease.
Note: In some deployments the Kubevirt CR is controlled by an external controller (e.g. HCO). In such cases, make sure to configure the wrapper operator/controller so the changes will get preserved.
Some plugins may need additional resources to be added to the compute container of the virt-launcher pod.
It is possible to specify compute resource overhead that will be added to the compute container of virt-launcher pods derived from virtual machines using the plugin.
Note: At the moment, only memory overhead requests are supported.
Note: In some deployments the Kubevirt CR is controlled by an external controller (e.g. HCO). In such cases, make sure to configure the wrapper operator/controller so the changes will get preserved.
Every compute container in a virt-launcher pod derived from a VM using the passt network binding plugin, will have an additional 500Mi memory overhead.
When configuring the VM/VMI network interface, the binding plugin name can be specified. If it exists in the Kubevirt CR, it will be used to setup the network interface.
Before creating NetworkPolicy objects, make sure you are using a networking solution which supports NetworkPolicy. Network isolation is controlled entirely by NetworkPolicy objects. By default, all vmis in a namespace are accessible from other vmis and network endpoints. To isolate one or more vmis in a project, you can create NetworkPolicy objects in that namespace to indicate the allowed incoming connections.
Note: vmis and pods are treated equally by network policies, since labels are passed through to the pods which contain the running vmi. With other words, labels on vmis can be matched by spec.podSelector on the policy.
"},{"location":"network/networkpolicy/#create-networkpolicy-to-deny-all-traffic","title":"Create NetworkPolicy to Deny All Traffic","text":"
To make a project \"deny by default\" add a NetworkPolicy object that matches all vmis but accepts no traffic.
"},{"location":"network/networkpolicy/#create-networkpolicy-to-only-accept-connections-from-vmis-within-namespaces","title":"Create NetworkPolicy to only Accept connections from vmis within namespaces","text":"
To make vmis accept connections from other vmis in the same namespace, but reject all other connections from vmis in other namespaces:
"},{"location":"network/networkpolicy/#create-networkpolicy-to-only-allow-http-and-https-traffic","title":"Create NetworkPolicy to only allow HTTP and HTTPS traffic","text":"
To enable only HTTP and HTTPS access to the vmis, add a NetworkPolicy object similar to:
"},{"location":"network/networkpolicy/#create-networkpolicy-to-deny-traffic-by-labels","title":"Create NetworkPolicy to deny traffic by labels","text":"
To make one specific vmi with a label type: test to reject all traffic from other vmis, create:
Once the VirtualMachineInstance is started, in order to connect to a VirtualMachineInstance, you can create a Service object for a VirtualMachineInstance. Currently, three types of service are supported: ClusterIP, NodePort and LoadBalancer. The default type is ClusterIP.
Note: Labels on a VirtualMachineInstance are passed through to the pod, so simply add your labels for service creation to the VirtualMachineInstance. From there on it works like exposing any other k8s resource, by referencing these labels in a service.
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-clusterip-service","title":"Expose VirtualMachineInstance as a ClusterIP Service","text":"
Give a VirtualMachineInstance with the label special: key:
Notes: * If --target-port is not set, it will be take the same value as --port * The cluster IP is usually allocated automatically, but it may also be forced into a value using the --cluster-ip flag (assuming value is in the valid range and not taken)
Query the service object:
$ kubectl get service\nNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\nvmiservice ClusterIP 172.30.3.149 <none> 27017/TCP 2m\n
You can connect to the VirtualMachineInstance by service IP and service port inside the cluster network:
$ ssh cirros@172.30.3.149 -p 27017\n
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-nodeport-service","title":"Expose VirtualMachineInstance as a NodePort Service","text":"
Expose the SSH port (22) of a VirtualMachineInstance running on KubeVirt by creating a NodePort service:
Notes: * If --node-port is not set, its value will be allocated dynamically (in the range above 30000) * If the --node-port value is set, it must be unique across all services
The service can be listed by querying for the service objects:
$ kubectl get service\nNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\nnodeport NodePort 172.30.232.73 <none> 27017:30000/TCP 5m\n
Connect to the VirtualMachineInstance by using a node IP and node port outside the cluster network:
$ ssh cirros@$NODE_IP -p 30000\n
"},{"location":"network/service_objects/#expose-virtualmachineinstance-as-a-loadbalancer-service","title":"Expose VirtualMachineInstance as a LoadBalancer Service","text":"
Expose the RDP port (3389) of a VirtualMachineInstance running on KubeVirt by creating LoadBalancer service. Here is an example:
With the macvtap binding plugin, virtual machines are directly exposed to the Kubernetes nodes L2 network. This is achieved by 'extending' an existing network interface with a virtual device that has its own MAC address.
Its main benefits are:
Direct connection to the node nic with no intermediate bridges.
"},{"location":"network/net_binding_plugins/macvtap/#functionality-support","title":"Functionality support","text":"Functionality Support Run without extra capabilities (on pod) Yes Migration support No IPAM support (on pod) No Primary network (pod network) No Secondary network Yes"},{"location":"network/net_binding_plugins/macvtap/#known-issues","title":"Known Issues","text":"
Live migration is not fully supported, see issue #5912
Warning: On KinD clusters, the user needs to adjust the cluster configuration, mounting dev of the running host onto the KinD nodes, because of a known issue.
In order to use macvtap, the following points need to be covered:
Deploy the CNI plugin binary on the nodes.
Deploy the Device Plugin daemon on the nodes.
Configure which node interfaces are exposed.
Define a NetworkAttachmentDefinition that points to the CNI plugin.
"},{"location":"network/net_binding_plugins/macvtap/#macvtap-cni-and-dp-deployment-on-nodes","title":"Macvtap CNI and DP deployment on nodes","text":"
To simplify the procedure, use the Cluster Network Addons Operator to deploy and configure the macvtap components in your cluster.
The aforementioned operator effectively deploys the macvtap cni and device plugin.
"},{"location":"network/net_binding_plugins/macvtap/#expose-node-interface-to-the-macvtap-device-plugin","title":"Expose node interface to the macvtap device plugin","text":"
There are two different alternatives to configure which host interfaces get exposed to the user, enabling them to create macvtap interfaces on top of:
select the host interfaces: indicates which host interfaces are exposed.
expose all interfaces: all interfaces of all hosts are exposed.
Both options are configured via the macvtap-deviceplugin-config ConfigMap, and more information on how to configure it can be found in the macvtap-cni repo.
This is a minimal example, in which the eth0 interface of the Kubernetes nodes is exposed, via the lowerDevice attribute.
This step can be omitted, since the default configuration of the aforementioned ConfigMap is to expose all host interfaces (which is represented by the following configuration):
The object should be created in a \"default\" namespace where all other namespaces can access, or, in the same namespace the VMs reside in.
The requested k8s.v1.cni.cncf.io/resourceName annotation must point to an exposed host interface (via the lowerDevice attribute, on the macvtap-deviceplugin-configConfigMap).
The binding plugin replaces the experimental core macvtap binding implementation (including its API).
Note: The network binding plugin infrastructure and the macvtap plugin specifically are in Alpha stage. Please use them with care, preferably on a non-production deployment.
The macvtap binding plugin consists of the following components:
Macvtap CNI plugin.
The plugin needs to:
Enable the network binding plugin framework FG.
Register the binding plugin on the Kubevirt CR.
Reference the network binding by name from the VM spec interface.
Note: The specific macvtap plugin has no FG by its own. It is up to the cluster admin to decide if the plugin is to be available in the cluster. The macvtap binding is still in evaluation, use it with care.
Plug A Simple Socket Transport is an enhanced alternative to SLIRP, providing user-space network connectivity.
passt is a universal tool which implements a translation layer between a Layer-2 network interface and native Layer -4 sockets (TCP, UDP, ICMP/ICMPv6 echo) on a host.
Its main benefits are:
Doesn't require extra network capabilities as CAP_NET_RAW and CAP_NET_ADMIN.
Allows integration with service meshes (which expect applications to run locally) out of the box.
Supports IPv6 out of the box (in contrast to the existing bindings which require configuring IPv6 manually).
"},{"location":"network/net_binding_plugins/passt/#functionality-support","title":"Functionality support","text":"Functionality Support Migration support Yes Service Mesh support Yes Pod IP in guest Yes Custom CIDR in guest No Require extra capabilities (on pod) to operate No Primary network (pod network) Yes Secondary network No"},{"location":"network/net_binding_plugins/passt/#node-optimization-requirementsrecommendations","title":"Node optimization requirements/recommendations:","text":"
To get better performance the node should be configured with:
To run multiple passt VMs with no explicit ports, the node's fs.file-max should be increased (for a VM forwards all IPv4 and IPv6 ports, for TCP and UDP, passt needs to create ~2^18 sockets):
sysctl -w fs.file-max = 9223372036854775807\n
NOTE: To achieve optimal memory consumption with Passt binding, specify ports required for your workload. When no ports are explicitly specified, all ports are forwarded, leading to memory overhead of up to 800 Mi.
The binding plugin replaces the experimental core passt binding implementation (including its API).
Note: The network binding plugin infrastructure and the passt plugin specifically are in Alpha stage. Please use them with care, preferably on a non-production deployment.
The passt binding plugin consists of the following components:
Passt CNI plugin.
Sidecar image.
As described in the definition & flow section, the passt plugin needs to:
Deploy the CNI plugin binary on the nodes.
Define a NetworkAttachmentDefinition that points to the CNI plugin.
Assure access to the sidecar image.
Enable the network binding plugin framework FG.
Register the binding plugin on the Kubevirt CR.
Reference the network binding by name from the VM spec interface.
And in detail:
"},{"location":"network/net_binding_plugins/passt/#passt-cni-deployment-on-nodes","title":"Passt CNI deployment on nodes","text":"
The CNI plugin binary can be retrieved directly from the kubevirt release assets (on GitHub) or to be built from its sources.
Note: The kubevirt project uses Bazel to build the binaries and container images. For more information in how to build the whole project, visit the developer getting started guide.
Once the binary is ready, you may rename it to a meaningful name (e.g. kubevirt-passt-binding). This name is used in the NetworkAttachmentDefinition configuration.
Copy the binary to each node in your cluster. The location of the CNI plugins may vary between platforms and versions. One common path is /opt/cni/bin/.
Note: The specific passt plugin has no FG by its own. It is up to the cluster admin to decide if the plugin is to be available in the cluster. The passt binding is still in evaluation, use it with care.
The clone.kubevirt.io API Group defines resources for cloning KubeVirt objects. Currently, the only supported cloning type is VirtualMachine, but more types are planned to be supported in the future (see future roadmap below).
Please bear in mind that the clone API is in version v1alpha1. This means that this API is not fully stable yet and that APIs may change in the future.
Under the hood, the clone API relies upon Snapshot & Restore APIs. Therefore, in order to be able to use the clone API, please see Snapshot & Restore prerequisites.
Firstly, as written above, the clone API relies upon Snapshot & Restore APIs under the hood. Therefore, it might be helpful to look at Snapshot & Restore user-guide page for more info.
The source and target indicate the source/target API group, kind and name. A few important notes:
Currently, the only supported kinds are VirtualMachine (of kubevirt.io api group) and VirtualMachineSnapshot ( of snapshot.kubevirt.io api group), but more types are expected to be supported in the future. See \"future roadmap\" below for more info.
The target name is optional. If unspecified, the clone controller will generate a name for the target automatically.
The target and source must reside in the same namespace.
These spec fields are intended to determine which labels / annotations are being copied to the target or stripped away.
The filters are a list of strings. Each string represents a key that may exist at the source. Every source key that matches to one of these values is being copied to the cloned target. In addition, special regular-expression-like characters can be used:
Wildcard character (*) can be used to match anything. Wildcard can be only used at the end of the filter.
These filters are valid:
\"*\"
\"some/key*\"
These filters are invalid:
\"some/*/key\"
\"*/key\"
Negation character (!) can be used to avoid matching certain keys. Negation can be only used at the beginning of a filter. Note that a Negation and Wildcard can be used together.
These filters are valid:
\"!some/key\"
\"!some/*\"
These filters are invalid:
\"key!\"
\"some/!key\"
Setting label / annotation filters is optional. If unset, all labels / annotations will be copied as a default.
Some network CNIs such as Kube-OVN or OVN-Kubernetes inject network information into the annotations of a VM. When cloning a VM from a target VM the cloned VM will use the same network. To avoid this you can use template labels and annotation filters.
This field is used to explicitly replace MAC addresses for certain interfaces. The field is a string to string map; the keys represent interface names and the values represent the new MAC address for the clone target.
This field is optional. By default, all mac addresses are stripped out. This suits situations when kube-mac-pool is deployed in the cluster which would automatically assign the target with a fresh valid MAC address.
The clone API is in an early alpha version and may change dramatically. There are many improvements and features that are expected to be added, the most significant goals are:
Add more supported source types like VirtualMachineInstace in the future.
Add a cross-namespace clone support. This needs to be supported for snapshots / restores first.
"},{"location":"storage/clone_api/#using-clones-as-a-golden-vm-image","title":"Using clones as a \"golden VM image\"","text":"
One of the great things that could be accomplished with the clone API when the source is of kind VirtualMachineSnapshot is to create \"golden VM images\" (a.k.a. Templates / Bookmark VMs / etc). In other words, the following workflow would be available:
Create a golden image
Create a VM
Prepare a \"golden VM\" environment
This can mean different things in different contexts. For example, write files, install applications, apply configurations, or anything else.
Snapshot the VM
Delete the VM
Then, this \"golden image\" can be duplicated as many times as needed. To instantiate a VM from the snapshot:
Create a Clone object where the source would point to the previously taken snapshot
Create as many VMs you need
This feature is still under discussions and may be implemented differently then explained here.
"},{"location":"storage/containerized_data_importer/","title":"Containerized Data Importer","text":"
The Containerized Data Importer (CDI) project provides facilities for enabling Persistent Volume Claims (PVCs) to be used as disks for KubeVirt VMs by way of DataVolumes. The three main CDI use cases are:
Import a disk image from a web server or container registry to a DataVolume
Clone an existing PVC to a DataVolume
Upload a local disk image to a DataVolume
This document deals with the third use case. So you should have CDI installed in your cluster, a VM disk that you'd like to upload, and virtctl in your path.
CDI supports the raw and qcow2 image formats which are supported by qemu. See the qemu documentation for more details. Bootable ISO images can also be used and are treated like raw images. Images may be compressed with either the gz or xz format.
The example in this document uses this CirrOS image
virtctl has an image-upload command with the following options:
virtctl image-upload --help\nUpload a VM image to a DataVolume/PersistentVolumeClaim.\n\nUsage:\n virtctl image-upload [flags]\n\nExamples:\n # Upload a local disk image to a newly created DataVolume:\n virtctl image-upload dv fedora-dv --size=10Gi --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to an existing DataVolume\n virtctl image-upload dv fedora-dv --no-create --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to a newly created PersistentVolumeClaim\n virtctl image-upload pvc fedora-pvc --size=10Gi --image-path=/images/fedora30.qcow2\n\n # Upload a local disk image to a newly created PersistentVolumeClaim and label it with a default instance type and preference\n virtctl image-upload pvc fedora-pvc --size=10Gi --image-path=/images/fedora30.qcow2 --default-instancetype=n1.medium --default-preference=fedora\n\n # Upload a local disk image to an existing PersistentVolumeClaim\n virtctl image-upload pvc fedora-pvc --no-create --image-path=/images/fedora30.qcow2\n\n # Upload to a DataVolume with explicit URL to CDI Upload Proxy\n virtctl image-upload dv fedora-dv --uploadproxy-url=https://cdi-uploadproxy.mycluster.com --image-path=/images/fedora30.qcow2\n\n # Upload a local disk archive to a newly created DataVolume:\n virtctl image-upload dv fedora-dv --size=10Gi --archive-path=/images/fedora30.tar\n\nFlags:\n --access-mode string The access mode for the PVC.\n --archive-path string Path to the local archive.\n --default-instancetype string The default instance type to associate with the image.\n --default-instancetype-kind string The default instance type kind to associate with the image.\n --default-preference string The default preference to associate with the image.\n --default-preference-kind string The default preference kind to associate with the image.\n --force-bind Force bind the PVC, ignoring the WaitForFirstConsumer logic.\n -h, --help help for image-upload\n --image-path string Path to the local VM image.\n --insecure Allow insecure server connections when using HTTPS.\n --no-create Don't attempt to create a new DataVolume/PVC.\n --size string The size of the DataVolume to create (ex. 10Gi, 500Mi).\n --storage-class string The storage class for the PVC.\n --uploadproxy-url string The URL of the cdi-upload proxy service.\n --volume-mode string Specify the VolumeMode (block/filesystem) used to create the PVC. Default is the storageProfile default. For archive upload default is filesystem.\n --wait-secs uint Seconds to wait for upload pod to start. (default 300)\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
virtctl image-upload works by creating a DataVolume of the requested size, sending an UploadTokenRequest to the cdi-apiserver, and uploading the file to the cdi-uploadproxy.
virtctl image-upload dv cirros-vm-disk --size=500Mi --image-path=/home/mhenriks/images/cirros-0.4.0-x86_64-disk.img --uploadproxy-url=<url to upload proxy service>\n
"},{"location":"storage/containerized_data_importer/#addressing-certificate-issues-when-uploading-images","title":"Addressing Certificate Issues when Uploading Images","text":"
Issues with the certificates can be circumvented by using the --insecure flag to prevent the virtctl command from verifying the remote host. It is better to resolve certificate issues that prevent uploading images using the virtctl image-upload command and not use the --insecure flag.
The following are some common issues with certificates and some easy ways to fix them.
"},{"location":"storage/containerized_data_importer/#does-not-contain-any-ip-sans","title":"Does not contain any IP SANs","text":"
This issue happens when trying to upload images using an IP address instead of a resolvable name. For example, trying to upload to the IP address 192.168.39.32 at port 31001 would produce the following error.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://192.168.39.32:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://192.168.39.32:31001\n\n 0 B / 193.89 MiB [-------------------------------------------------------] 0.00% 0s\n\nPost https://192.168.39.32:31001/v1beta1/upload: x509: cannot validate certificate for 192.168.39.32 because it doesn't contain any IP SANs\n
It is easily fixed by adding an entry it your local name resolution service. This could be a DNS server or the local hosts file. The URL used to upload the proxy should be changed to reflect the resolvable name.
The Subject and the Subject Alternative Name in the certificate contain valid names that can be used for resolution. Only one of these names needs to be resolvable. Use the openssl command to view the names of the cdi-uploadproxy service.
Adding the following entry to the /etc/hosts file, if it provides name resolution, should fix this issue. Any service that provides name resolution for the system could be used.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 193.89 MiB / 193.89 MiB [=============================================] 100.00% 1m38s\n\nUploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress\nProcessing completed successfully\nUploading Fedora-Cloud-Base-33-1.2.x86_64.raw.xz completed successfully\n
"},{"location":"storage/containerized_data_importer/#certificate-signed-by-unknown-authority","title":"Certificate Signed by Unknown Authority","text":"
This happens because the cdi-uploadproxy certificate is self signed and the system does not trust the cdi-uploadproxy as a Certificate Authority.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 0 B / 193.89 MiB [-------------------------------------------------------] 0.00% 0s\n\nPost https://cdi-uploadproxy:31001/v1beta1/upload: x509: certificate signed by unknown authority\n
This can be fixed by adding the certificate to the systems trust store. Download the cdi-uploadproxy-server-cert.
virtctl image-upload dv f33 \\\n --size 5Gi \\\n --image-path Fedora-Cloud-Base-33-1.2.x86_64.raw.xz \\\n --uploadproxy-url https://cdi-uploadproxy:31001\n\nPVC default/f33 not found \nDataVolume default/f33 created\nWaiting for PVC f33 upload pod to be ready...\nPod now ready\nUploading data to https://cdi-uploadproxy:31001\n\n 193.89 MiB / 193.89 MiB [=============================================] 100.00% 1m36s\n\nUploading data completed successfully, waiting for processing to complete, you can hit ctrl-c without interrupting the progress\nProcessing completed successfully\nUploading Fedora-Cloud-Base-33-1.2.x86_64.raw.xz completed successfully\n
"},{"location":"storage/containerized_data_importer/#setting-the-url-of-the-cdi-upload-proxy-service","title":"Setting the URL of the cdi-upload Proxy Service","text":"
Setting the URL for the cdi-upload proxy service allows the virtctl image-upload command to upload the images without specifying the --uploadproxy-url flag. Permanently setting the URL is done by patching the CDI configuration.
The following will set the default upload proxy to use port 31001 of cdi-uploadproxy. An IP address could also be used instead of the dns name.
See the section Addressing Certificate Issues when Uploading for why cdi-uploadproxy was chosen and issues that can be encountered when using an IP address.
"},{"location":"storage/containerized_data_importer/#connect-to-virtualmachineinstance-console","title":"Connect to VirtualMachineInstance console","text":"
Use virtctl to connect to the newly create VirtualMachineInstance.
virtctl console cirros-vm\n
"},{"location":"storage/disks_and_volumes/","title":"Filesystems, Disks and Volumes","text":"
Making persistent storage in the cluster (volumes) accessible to VMs consists of three parts. First, volumes are specified in spec.volumes. Second, disks are added to the VM by specifying them in spec.domain.devices.disks. Finally, a reference to the specified volume is added to the disk specification by name.
Like all other vmi devices a spec.domain.devices.disks element has a mandatory name, and furthermore, the disk's name must reference the name of a volume inside spec.volumes.
A disk can be made accessible via four different types:
lun
disk
cdrom
fileystems
All possible configuration options are available in the Disk API Reference.
All types allow you to specify the bus attribute. The bus attribute determines how the disk will be presented to the guest operating system.
It is possible to reserve a LUN through the the SCSI Persistent Reserve commands. In order to issue privileged SCSI ioctls, the VM requires activation of the persistent resevation flag:
Note: The persistent reservation feature enables an additional privileged component to be deployed together with virt-handler. Because this feature allows for sensitive security procedures, it is disabled by default and requires cluster administrator configuration.
A disk disk will expose the volume as an ordinary disk to the VM.
A minimal example which attaches a PersistentVolumeClaim named mypvc as a disk device to the VM:
metadata:\n name: testvmi-disk\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a disk\n disk: {}\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
You can set the disk bus type, overriding the defaults, which in turn depends on the chipset the VM is configured to use:
metadata:\n name: testvmi-disk\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a disk\n disk:\n # This makes it exposed as /dev/vda, being the only and thus first\n # disk attached to the VM\n bus: virtio\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
A cdrom disk will expose the volume as a cdrom drive to the VM. It is read-only by default.
A minimal example which attaches a PersistentVolumeClaim named mypvc as a cdrom device to the VM:
metadata:\n name: testvmi-cdrom\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nspec:\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: mypvcdisk\n # This makes it a cdrom\n cdrom:\n # This makes the cdrom writeable\n readonly: false\n # This makes the cdrom be exposed as SATA device\n bus: sata\n volumes:\n - name: mypvcdisk\n persistentVolumeClaim:\n claimName: mypvc\n
A filesystem device will expose the volume as a filesystem to the VM. filesystems rely on virtiofs to make visible external filesystems to KubeVirt VMs. Further information about virtiofs can be found at the Official Virtiofs Site.
Compared with disk, filesystems allow changes in the source to be dynamically reflected in the volumes inside the VM. For instance, if a given configMap is shared with filesystems any change made on it will be reflected in the VMs. However, it is important to note that filesystems do not allow live migration.
Additionally, filesystem devices must be mounted inside the VM. This can be done through cloudInitNoCloud or manually connecting to the VM shell and targeting the same command. The main challenge is to understand how the device tag used to identify the new filesystem and mount it with the mount -t virtiofs [device tag] [path] command. For that purpose, the tag is assigned to the filesystem in the VM spec spec.domain.devices.filesystems.name. For instance, if in a given VM spec is spec.domain.devices.filesystems.name: foo, the required command inside the VM to mount the filesystem in the /tmp/foo path will be mount -t virtiofs foo /tmp/foo:
Note: As stated, filesystems rely on virtiofs. Moreover, virtiofs requires kernel linux support to work in the VM. To check if the linux image of the VM has the required support, you can address the following command: modprobe virtiofs. If the command output is modprobe: FATAL: Module virtiofs not found, the linux image of the VM does not support virtiofs. Also, you can check if the kernel version is up to 5.4 in any linux distribution or up to 4.18 in centos/rhel. To check this, you can target the following command: uname -r.
Refer to section Sharing Directories with VMs for usage examples of filesystems.
The error policy controls how the hypervisor should behave when an IO error occurs on a disk read or write. The default behaviour is to stop the guest and a Kubernetes event is generated. However, it is possible to change the value to either:
report: the error is reported in the guest
ignore: the error is ignored, but the read/write failure goes undetected
enospace: error when there isn't enough space on the disk
The error policy can be specified per disk or lun.
Allows attaching cloudInitNoCloud data-sources to the VM. If the VM contains a proper cloud-init setup, it will pick up the disk as a user-data source.
A simple example which attaches a Secret as a cloud-init disk datasource may look like this:
Allows attaching cloudInitConfigDrive data-sources to the VM. If the VM contains a proper cloud-init setup, it will pick up the disk as a user-data source.
A simple example which attaches a Secret as a cloud-init disk datasource may look like this:
Allows connecting a PersistentVolumeClaim to a VM disk.
Use a PersistentVolumeClaim when the VirtualMachineInstance's disk needs to persist after the VM terminates. This allows for the VM's data to remain persistent between restarts.
A PersistentVolume can be in \"filesystem\" or \"block\" mode:
Filesystem: For KubeVirt to be able to consume the disk present on a PersistentVolume's filesystem, the disk must be named disk.img and be placed in the root path of the filesystem. Currently the disk is also required to be in raw format. > Important: The disk.img image file needs to be owned by the user-id 107 in order to avoid permission issues.
Note: If the disk.img image file has not been created manually before starting a VM then it will be created automatically with the PersistentVolumeClaim size. Since not every storage provisioner provides volumes with the exact usable amount of space as requested (e.g. due to filesystem overhead), KubeVirt tolerates up to 10% less available space. This can be configured with the developerConfiguration.pvcTolerateLessSpaceUpToPercent value in the KubeVirt CR (kubectl edit kubevirt kubevirt -n kubevirt).
Block: Use a block volume for consuming raw block devices. Note: you need to enable the BlockVolume feature gate.
A simple example which attaches a PersistentVolumeClaim as a disk may look like this:
"},{"location":"storage/disks_and_volumes/#thick-and-thin-volume-provisioning","title":"Thick and thin volume provisioning","text":"
Sparsification can make a disk thin-provisioned, in other words it allows to convert the freed space within the disk image into free space back on the host. The fstrim utility can be used on a mounted filesystem to discard the blocks not used by the filesystem. In order to be able to sparsify a disk inside the guest, the disk needs to be configured in the libvirt xml with the option discard=unmap. In KubeVirt, every disk is passed as default with this option enabled. It is possible to check if the trim configuration is supported in the guest by runninglsblk -D, and check the discard options supported on every disk.
However, in certain cases like preallocaton or when the disk is thick provisioned, the option needs to be disabled. The disk's PVC has to be marked with an annotation that contains /storage.preallocation or /storage.thick-provisioned, and set to true. If the volume is preprovisioned using CDI and the preallocation is enabled, then the PVC is automatically annotated with: cdi.kubevirt.io/storage.preallocation: true and the discard passthrough option is disabled.
Example of a PVC definition with the annotation to disable discard passthrough:
For some storage methods, Kubernetes may support expanding storage in-use (allowVolumeExpansion feature). KubeVirt can respond to it by making the additional storage available for the virtual machines. This feature is currently off by default, and requires enabling a feature gate. To enable it, add the ExpandDisks feature gate in the kubevirt object:
Enabling this feature does two things: - Notify the virtual machine about size changes - If the disk is a Filesystem PVC, the matching file is expanded to the remaining size (while reserving some space for file system overhead).
To use an externally managed local block device from a host ( e.g. /dev/sdb , zvol, LVM, etc... ) in a VM directly, you would need a provisioner that supports block devices, such as OpenEBS LocalPV.
Alternatively, local volumes can be provisioned by hand. I.e. the following PVC:
DataVolumes are a way to automate importing virtual machine disks onto PVCs during the virtual machine's launch flow. Without using a DataVolume, users have to prepare a PVC with a disk image before assigning it to a VM or VMI manifest. With a DataVolume, both the PVC creation and import is automated on behalf of the user.
"},{"location":"storage/disks_and_volumes/#datavolume-vm-behavior","title":"DataVolume VM Behavior","text":"
DataVolumes can be defined in the VM spec directly by adding the DataVolumes to the dataVolumeTemplates list. Below is an example.
You can see the DataVolume defined in the dataVolumeTemplates section has two parts. The source and pvc
The source part declares that there is a disk image living on an http server that we want to use as a volume for this VM. The pvc part declares the spec that should be used to create the PVC that hosts the source data.
When this VM manifest is posted to the cluster, as part of the launch flow a PVC will be created using the spec provided and the source data will be automatically imported into that PVC before the VM starts. When the VM is deleted, the storage provisioned by the DataVolume will automatically be deleted as well.
For a VMI object, DataVolumes can be referenced as a volume source for the VMI. When this is done, it is expected that the referenced DataVolume exists in the cluster. The VMI will consume the DataVolume, but the DataVolume's life-cycle will not be tied to the VMI.
Below is an example of a DataVolume being referenced by a VMI. It is expected that the DataVolume alpine-datavolume was created prior to posting the VMI manifest to the cluster. It is okay to post the VMI manifest to the cluster while the DataVolume is still having data imported. KubeVirt knows not to start the VMI until all referenced DataVolumes have finished their clone and import phases.
A DataVolume is a custom resource provided by the Containerized Data Importer (CDI) project. KubeVirt integrates with CDI in order to provide users a workflow for dynamically creating PVCs and importing data into those PVCs.
In order to take advantage of the DataVolume volume source on a VM or VMI, CDI must be installed.
Installing CDI
Go to the CDI release page
Pick the latest stable release and post the corresponding cdi-controller-deployment.yaml manifest to your cluster.
An ephemeral volume is a local COW (copy on write) image that uses a network volume as a read-only backing store. With an ephemeral volume, the network backing store is never mutated. Instead all writes are stored on the ephemeral image which exists on local storage. KubeVirt dynamically generates the ephemeral images associated with a VM when the VM starts, and discards the ephemeral images when the VM stops.
Ephemeral volumes are useful in any scenario where disk persistence is not desired. The COW image is discarded when VM reaches a final state (e.g., succeeded, failed).
Currently, only PersistentVolumeClaim may be used as a backing store of the ephemeral volume.
Up-to-date information on supported backing stores can be found in the KubeVirt API.
containerDisk was originally registryDisk, please update your code when needed.
The containerDisk feature provides the ability to store and distribute VM disks in the container image registry. containerDisks can be assigned to VMs in the disks section of the VirtualMachineInstance spec.
No network shared storage devices are utilized by containerDisks. The disks are pulled from the container registry and reside on the local node hosting the VMs that consume the disks.
"},{"location":"storage/disks_and_volumes/#when-to-use-a-containerdisk","title":"When to use a containerDisk","text":"
containerDisks are ephemeral storage devices that can be assigned to any number of active VirtualMachineInstances. This makes them an ideal tool for users who want to replicate a large number of VM workloads that do not require persistent data. containerDisks are commonly used in conjunction with VirtualMachineInstanceReplicaSets.
"},{"location":"storage/disks_and_volumes/#when-not-to-use-a-containerdisk","title":"When Not to use a containerDisk","text":"
containerDisks are not a good solution for any workload that requires persistent root disks across VM restarts.
Users can inject a VirtualMachineInstance disk into a container image in a way that is consumable by the KubeVirt runtime. Disks must be placed into the /disk directory inside the container. Raw and qcow2 formats are supported. Qcow2 is recommended in order to reduce the container image's size. containerdisks can and should be based on scratch. No content except the image is required.
Note: Prior to kubevirt 0.20, the containerDisk image needed to have kubevirt/container-disk-v1alpha as base image.
Note: The containerDisk needs to be readable for the user with the UID 107 (qemu).
Example: Inject a local VirtualMachineInstance disk into a container image.
Note that a containerDisk is file-based and therefore cannot be attached as a lun device to the VM.
"},{"location":"storage/disks_and_volumes/#custom-disk-image-path","title":"Custom disk image path","text":"
ContainerDisk also allows to store disk images in any folder, when required. The process is the same as previous. The main difference is, that in custom location, kubevirt does not scan for any image. It is your responsibility to provide full path for the disk image. Providing image path is optional. When no path is provided, kubevirt searches for disk images in default location: /disk.
An emptyDisk works similar to an emptyDir in Kubernetes. An extra sparse qcow2 disk will be allocated and it will live as long as the VM. Thus it will survive guest side VM reboots, but not a VM re-creation. The disk capacity needs to be specified.
Example: Boot cirros with an extra emptyDisk with a size of 2GiB:
"},{"location":"storage/disks_and_volumes/#when-to-use-an-emptydisk","title":"When to use an emptyDisk","text":"
Ephemeral VMs very often come with read-only root images and limited tmpfs space. In many cases this is not enough to install application dependencies and provide enough disk space for the application data. While this data is not critical and thus can be lost, it is still needed for the application to function properly during its lifetime. This is where an emptyDisk can be useful. An emptyDisk is often used and mounted somewhere in /var/lib or /var/run.
A hostDisk volume type provides the ability to create or use a disk image located somewhere on a node. It works similar to a hostPath in Kubernetes and provides two usage types:
DiskOrCreate if a disk image does not exist at a given location then create one
Disk a disk image must exist at a given location
Note: you need to enable the HostDisk feature gate.
Example: Create a 1Gi disk image located at /data/disk.img and attach it to a VM.
A configMap is a reference to a ConfigMap in Kubernetes. A configMap can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk","title":"As a disk","text":"
By using disk, an extra iso disk will be allocated which has to be mounted on a VM. To mount the configMap users can use cloudInit and the disk's serial number. The name needs to be set for a reference to the created kubernetes ConfigMap.
Note: Currently, ConfigMap update is not propagate into the VMI. If a ConfigMap is updated, only a pod will be aware of changes, not running VMIs.
Note: Due to a Kubernetes CRD issue, you cannot control the paths within the volume where ConfigMap keys are projected.
Example: Attach the configMap to a VM and use cloudInit to mount the iso disk:
"},{"location":"storage/disks_and_volumes/#as-a-filesystem","title":"As a filesystem","text":"
By using filesystem, configMaps are shared through virtiofs. In contrast with using disk for sharing configMaps, filesystem allows you to dynamically propagate changes on configMaps to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given configMap, the following VM definition could be used:
A secret is a reference to a Secret in Kubernetes. A secret can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk_1","title":"As a disk","text":"
By using disk, an extra iso disk will be allocated which has to be mounted on a VM. To mount the secret users can use cloudInit and the disks serial number. The secretName needs to be set for a reference to the created kubernetes Secret.
Note: Currently, Secret update propagation is not supported. If a Secret is updated, only a pod will be aware of changes, not running VMIs.
Note: Due to a Kubernetes CRD issue, you cannot control the paths within the volume where Secret keys are projected.
Example: Attach the secret to a VM and use cloudInit to mount the iso disk:
"},{"location":"storage/disks_and_volumes/#as-a-filesystem_1","title":"As a filesystem","text":"
By using filesystem, secrets are shared through virtiofs. In contrast with using disk for sharing secrets, filesystem allows you to dynamically propagate changes on secrets to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given secret, the following VM definition could be used:
A serviceAccount volume references a Kubernetes ServiceAccount. A serviceAccount can be presented to the VM as disks or as a filesystem. Each method is described in the following sections and both have some advantages and disadvantages, e.g. disk does not support dynamic change propagation and filesystem does not support live migration. Therefore, depending on the use-case, one or the other may be more suitable.
"},{"location":"storage/disks_and_volumes/#as-a-disk_2","title":"As a disk","text":"
By using disk, a new iso disk will be allocated with the content of the service account (namespace, token and ca.crt), which needs to be mounted in the VM. For automatic mounting, see the configMap and secret examples above.
Note: Currently, ServiceAccount update propagation is not supported. If a ServiceAccount is updated, only a pod will be aware of changes, not running VMIs.
"},{"location":"storage/disks_and_volumes/#as-a-filesystem_2","title":"As a filesystem","text":"
By using filesystem, serviceAccounts are shared through virtiofs. In contrast with using disk for sharing serviceAccounts, filesystem allows you to dynamically propagate changes on serviceAccounts to VMIs (i.e. the VM does not need to be rebooted).
Note: Currently, VMIs can not be live migrated since virtiofs does not support live migration.
To share a given serviceAccount, the following VM definition could be used:
downwardMetrics expose a limited set of VM and host metrics to the guest. The format is compatible with vhostmd.
Getting a limited set of host and VM metrics is in some cases required to allow third-parties diagnosing performance issues on their appliances. One prominent example is SAP HANA.
In order to expose downwardMetrics to VMs, the methods disk and virtio-serial port are supported.
Note: The DownwardMetrics feature gate must be enabled to use the metrics. Available starting with KubeVirt v0.42.0.
This method uses a virtio-serial port to expose the metrics data to the VM. KubeVirt creates a port named /dev/virtio-ports/org.github.vhostmd.1 inside the VM, in which the Virtio Transport protocol is supported. downwardMetrics can be retrieved from this port. See vhostmd documentation under Virtio Transport for further information.
To expose the metrics using a virtio-serial port, a downwardMetrics device must be added (i.e., spec.domain.devices.downwardMetrics: {}).
vm-dump-metrics is useful as a standalone tool to verify the serial port is working and to inspect the metrics. However, applications that consume metrics will usually connect to the virtio-serial port themselves.
Note: The tool vm-dump-metrics provides the option --virtio in case the virtio-serial port is used. Please, refer to vm-dump-metrics --help for further information.
Libvirt has the ability to use IOThreads for dedicated disk access (for supported devices). These are dedicated event loop threads that perform block I/O requests and improve scalability on SMP systems. KubeVirt exposes this libvirt feature through the ioThreadsPolicy setting. Additionally, each Disk device exposes a dedicatedIOThread setting. This is a boolean that indicates the specified disk should be allocated an exclusive IOThread that will never be shared with other disks.
Currently valid policies are shared and auto. If ioThreadsPolicy is omitted entirely, use of IOThreads will be disabled. However, if any disk requests a dedicated IOThread, ioThreadsPolicy will be enabled and default to shared.
An ioThreadsPolicy of shared indicates that KubeVirt should use one thread that will be shared by all disk devices. This policy stems from the fact that large numbers of IOThreads is generally not useful as additional context switching is incurred for each thread.
Disks with dedicatedIOThread set to true will not use the shared thread, but will instead be allocated an exclusive thread. This is generally useful if a specific Disk is expected to have heavy I/O traffic, e.g. a database spindle.
auto IOThreads indicates that KubeVirt should use a pool of IOThreads and allocate disks to IOThreads in a round-robin fashion. The pool size is generally limited to twice the number of VCPU's allocated to the VM. This essentially attempts to dedicate disks to separate IOThreads, but only up to a reasonable limit. This would come in to play for systems with a large number of disks and a smaller number of CPU's for instance.
As a caveat to the size of the IOThread pool, disks with dedicatedIOThread will always be guaranteed their own thread. This effectively diminishes the upper limit of the number of threads allocated to the rest of the disks. For example, a VM with 2 CPUs would normally use 4 IOThreads for all disks. However if one disk had dedicatedIOThread set to true, then KubeVirt would only use 3 IOThreads for the shared pool.
There is always guaranteed to be at least one thread for disks that will use the shared IOThreads pool. Thus if a sufficiently large number of disks have dedicated IOThreads assigned, auto and shared policies would essentially result in the same layout.
"},{"location":"storage/disks_and_volumes/#iothreads-with-dedicated-pinned-cpus","title":"IOThreads with Dedicated (pinned) CPUs","text":"
When guest's vCPUs are pinned to a host's physical CPUs, it is also best to pin the IOThreads to specific CPUs to prevent these from floating between the CPUs. KubeVirt will automatically calculate and pin each IOThread to a CPU or a set of CPUs, depending on the ration between them. In case there are more IOThreads than CPUs, each IOThread will be pinned to a CPU, in a round-robin fashion. Otherwise, when there are fewer IOThreads than CPU, each IOThread will be pinned to a set of CPUs.
"},{"location":"storage/disks_and_volumes/#iothreads-with-qemu-emulator-thread-and-dedicated-pinned-cpus","title":"IOThreads with QEMU Emulator thread and Dedicated (pinned) CPUs","text":"
To further improve the vCPUs latency, KubeVirt can allocate an additional dedicated physical CPU1, exclusively for the emulator thread, to which it will be pinned. This will effectively \"isolate\" the emulator thread from the vCPUs of the VMI. When ioThreadsPolicy is set to auto IOThreads will also be \"isolated\" from the vCPUs and placed on the same physical CPU as the QEMU emulator thread.
This VM is identical to the first, except it requests auto IOThreads. emptydisk and emptydisk2 will still be allocated individual IOThreads, but the rest of the disks will be split across 2 separate iothreads (twice the number of CPU cores is 4).
Block Multi-Queue is a framework for the Linux block layer that maps Device I/O queries to multiple queues. This splits I/O processing up across multiple threads, and therefor multiple CPUs. libvirt recommends that the number of queues used should match the number of CPUs allocated for optimal performance.
This feature is enabled by the BlockMultiQueue setting under Devices:
Note: Due to the way KubeVirt implements CPU allocation, blockMultiQueue can only be used if a specific CPU allocation is requested. If a specific number of CPUs hasn't been allocated to a VirtualMachine, KubeVirt will use all CPU's on the node on a best effort basis. In that case the amount of CPU allocation to a VM at the host level could change over time. If blockMultiQueue were to request a number of queues to match all the CPUs on a node, that could lead to over-allocation scenarios. To avoid this, KubeVirt enforces that a specific slice of CPU resources is requested in order to take advantage of this feature.
KubeVirt supports none, writeback, and writethrough KVM/QEMU cache modes.
none I/O from the guest is not cached on the host. Use this option for guests with large I/O requirements. This option is generally the best choice.
writeback I/O from the guest is cached on the host and written through to the physical media when the guest OS issues a flush.
writethrough I/O from the guest is cached on the host but must be written through to the physical medium before the write operation completes.
Important: none cache mode is set as default if the file system supports direct I/O, otherwise, writethrough is used.
Note: It is possible to force a specific cache mode, although if none mode has been chosen and the file system does not support direct I/O then started VMI will return an error.
Shareable disks allow multiple VMs to share the same underlying storage. In order to use this feature, special care is required because this could lead to data corruption and the loss of important data. Shareable disks demand either data synchronization at the application level or the use of clustered filesystems. These advanced configurations are not within the scope of this documentation and are use-case specific.
If the shareable option is set, it indicates to libvirt/QEMU that the disk is going to be accessed by multiple VMs and not to create a lock for the writes.
In this example, we use Rook Ceph in order to dynamically provisioning the PVC.
We can now attempt to write a string from the first guest and then read the string from the second guest to test that the sharing is working.
$ virtctl console vm-block-1\n$ printf \"Test awesome shareable disks\" | sudo dd of=/dev/vdc bs=1 count=150 conv=notrunc\n28+0 records in\n28+0 records out\n28 bytes copied, 0.0264182 s, 1.1 kB/s\n# Log into the second guest\n$ virtctl console vm-block-2\n$ sudo dd if=/dev/vdc bs=1 count=150 conv=notrunc\nTest awesome shareable disks150+0 records in\n150+0 records out\n150 bytes copied, 0.136753 s, 1.1 kB/s\n
If you are using local devices or RWO PVCs, setting the affinity on the VMs that share the storage guarantees they will be scheduled on the same node. In the example, we set the affinity on the second VM using the label used on the first VM. If you are using shared storage with RWX PVCs, then the affinity rule is not necessary as the storage can be attached simultaneously on multiple nodes.
"},{"location":"storage/disks_and_volumes/#sharing-directories-with-vms","title":"Sharing Directories with VMs","text":"
Virtiofs allows to make visible external filesystems to KubeVirt VMs. Virtiofs is a shared file system that lets VMs access a directory tree on the host. Further details can be found at Official Virtiofs Site.
"},{"location":"storage/disks_and_volumes/#non-privileged-and-privileged-sharing-modes","title":"Non-Privileged and Privileged Sharing Modes","text":"
KubeVirt supports two PVC sharing modes: non-privileged and privileged.
The non-privileged mode is enabled by default. This mode has the advantage of not requiring any administrative privileges for creating the VM. However, it has some limitations:
The virtiofsd daemon (the daemon in charge of sharing the PVC with the VM) will run with the QEMU UID/GID (107), and cannot switch between different UIDs/GIDs. Therefore, it will only have access to directories and files that UID/GID 107 has permission to. Additionally, when creating new files they will always be created with QEMU's UID/GID regardless of the UID/GID of the process within the guest.
Extended attributes are not supported.
To switch to the privileged mode, the feature gate ExperimentalVirtiofsSupport has to be enabled. Take into account that this mode requires privileges to run rootful containers.
"},{"location":"storage/disks_and_volumes/#configuration-inside-the-vm","title":"Configuration Inside the VM","text":"
The following configuration can be done in using startup script. See cloudInitNoCloud section for more details. However, we can do it manually by logging in to the VM and mounting it. Here are examples of how to mount it in a linux and windows VMs:
It is allowed using hostpaths. The following configuration example is shown for illustrative purposes. However, the PVCs method is preferred since using hostpath is generally discouraged for security reasons.
"},{"location":"storage/disks_and_volumes/#configuration-inside-the-node","title":"Configuration Inside the Node","text":"
To share the directory with the VMs, we need to log in to the node, create the shared directory (if it does not already exist), and set the proper SELinux context label container_file_t to the shared directory. In this example we are going to share a new directory /mnt/data (if the desired directory is an existing one, you can skip the mkdir command):
Note: If you are attempting to share an existing directory, you must first check the SELinux context label with the command ls -Z <directory>. In the case that the label is not present or is not container_file_t you need to label it with the chcon command.
The updateVolumesStrategy field is used to specify the strategy for updating the volumes of a running VM. The following strategies are supported: * Replacement: the update volumes will be replaced upon the VM restart. * Migration: the update of the volumes will trigger a storage migration of the old volumes to the new ones. More details about volume migration can be found in the volume migration documentation.
The update volume migration depends on the feature gate VolumesUpdateStrategy which depends on the VMLiveUpdateFeatures feature gate and configuration.
It can be desirable to export a Virtual Machine and its related disks out of a cluster so you can import that Virtual Machine into another system or cluster. The Virtual Machine disks are the most prominent things you will want to export. The export API makes it possible to declaratively export Virtual Machine disks. It is also possible to export individual PVCs and their contents, for instance when you have created a memory dump from a VM or are using virtio-fs to have a Virtual Machine populate a PVC.
In order not to overload the kubernetes API server the data is transferred through a dedicated export proxy server. The proxy server can then be exposed to the outside world through a service associated with an Ingress/Route or NodePort. As an alternative, the port-forward flag can be used with the virtctl integration to bypass the need of an Ingress/Route.
VMExport support must be enabled in the feature gates to be available. The feature gates field in the KubeVirt CR must be expanded by adding the VMExport to it.
In order to securely export a Virtual Machine Disk, you must create a token that is used to authorize users accessing the export endpoint. This token must be in the same namespace as the Virtual Machine. The contents of the secret can be passed as a token header or parameter to the export URL. The name of the header or argument is x-kubevirt-export-token with a value that matches the content of the secret. The secret can be named any valid secret in the namespace. We recommend you generate an alpha numeric token of at least 12 characters. The data key should be token. For example:
After you have created the token you can now create a VMExport CR that identifies the Virtual Machine you want to export. You can create a VMExport that looks like this:
The following volumes present in the VM will be exported:
PersistentVolumeClaims
DataVolumes
MemoryDump
All other volume types are not exported. To avoid the export of inconsistent data, a Virtual Machine can only be exported while it is powered off. Any active VM exports will be terminated if the Virtual Machine is started. To export data from a running Virtual Machine you must first create a Virtual Machine Snapshot (see below).
If the VM contains multiple volumes that can be exported, each volume will get its own URL links. If the VM contains no volumes that can be exported, the VMExport will go into a Skipped phase, and no export server is started.
When you create a VMExport based on a Virtual Machine Snapshot, the controller will attempt to create PVCs from the volume snapshots contained in Virtual Machine Snapshot. Once all the PVCs are ready, the export server will start and you can begin the export. If the Virtual Machine Snapshot contains multiple volumes that can be exported, each volume will get its own URL links. If the Virtual Machine snapshot contains no volumes that can be exported, the VMExport will go into a skipped phase, and no export server is started.
In this example the PVC name is example-pvc. Note the PVC doesn't need to contain a Virtual Machine Disk, it can contain any content, but the main use case is exporting Virtual Machine Disks. After you post this yaml to the cluster, a new export server is created in the same namespace as the PVC. If the source PVC is in use by another pod (such as the virt-launcher pod) then the export will remain pending until the PVC is no longer in use. If the exporter server is active and another pod starts using the PVC, the exporter server will be terminated until the PVC is not in use anymore.
"},{"location":"storage/export_api/#export-status-links","title":"Export status links","text":"
The VirtualMachineExport CR will contain a status with internal and external links to the export service. The internal links are only valid inside the cluster, and the external links are valid for external access through an Ingress or Route. The cert field will contain the CA that signed the certificate of the export server for internal links, or the CA that signed the Route or Ingress.
The following is an example of exporting a PVC that contains a KubeVirt disk image. The controller determines if the PVC contains a kubevirt disk by checking if there is a special annotation on the PVC, or if there is a DataVolume ownerReference on the PVC, or if the PVC has a volumeMode of block.
Archive content-type is automatically selected if we are unable to determine the PVC contains a KubeVirt disk. The archive will contain all the files that are in the PVC.
The VirtualMachine manifests can be retrieved by accessing the manifests in the VirtualMachineExport status. The all type will return the VirtualMachine manifest, any DataVolumes, and a configMap that contains the public CA certificate of the Ingress/Route of the external URL, or the CA of the export server of the internal URL. The auth-header-secret will be a secret that contains a Containerized Data Importer (CDI) compatible header. This header contains a text version of the export token.
Both internal and external links will contain a manifests field. If there are no external links, then there will not be any external manifests either. The virtualMachine manifests field is only available if the source is a VirtualMachine or VirtualMachineSnapshot. Exporting a PersistentVolumeClaim will not generate a Virtual Machine manifest.
Gzip. The raw KubeVirt disk image but gzipped to help with transferring efficiency.
Dir. A directory listing, allowing you to find the files contained in the PVC.
Tar.gz The contents of the PVC tarred and gzipped in a single file.
Raw and Gzip will be selected if the PVC is determined to be a KubeVirt disk. KubeVirt disks contain a single disk.img file (or are a block device). Dir will return a list of the files in the PVC, to download a specific file you can replace /dir in the URL with the path and file name. For instance if the PVC contains the file /example/data.txt you can replace /dir with /example/data.txt to download just data.txt file. Or you can use the tar.gz URL to get all the contents of the PVC in a tar file.
"},{"location":"storage/export_api/#internal-link-certificates","title":"Internal link certificates","text":"
The export server certificate is valid for 7 days after which it is rotated by deleting the export server pod and associated secret and generating a new one. If for whatever reason the export server pod dies, the associated secret is also automatically deleted and a new pod and secret are generated. The VirtualMachineExport object status will be automatically updated to reflect the new certificate.
"},{"location":"storage/export_api/#external-link-certificates","title":"External link certificates","text":"
The external link certificates are associated with the Ingress/Route that points to the service created by the KubeVirt operator. The CA that signed the Ingress/Route will part of the certificates.
"},{"location":"storage/export_api/#ttl-time-to-live-for-an-export","title":"TTL (Time to live) for an Export","text":"
For various reasons (security being one), users should be able to specify a TTL for the VMExport objects that limits the lifetime of an export. This is done via the ttlDuration field which accepts a k8s duration, which defaults to 2 hours when not specified.
# Creates a VMExport object according to the specified flag.\n\n# The flag should either be:\n\n# --pvc, to specify the name of the pvc to export.\n# --snapshot, to specify the name of the VM snapshot to export.\n# --vm, to specify the name of the Virtual Machine to export.\n\n$ virtctl vmexport create name [flags]\n
# Downloads a volume from the defined VMExport object.\n\n# The main available flags are:\n\n# --output, mandatory flag to specify the output file.\n# --volume, optional flag to specify the name of the downloadable volume.\n# --vm|--snapshot|--pvc, if specified, are used to create the VMExport object assuming it doesn't exist. The name of the object to export has to be specified.\n# --format, optional flag to specify wether to download the file in compressed (default) or raw format.\n# --port-forward, optional flag to easily download the volume without the need of an ingress or route. Also, the local port can be optionally specified with the --local-port flag.\n\n$ virtctl vmexport download name [flags]\n
By default, the volume will be downloaded in compressed format. Users can specify the desired format (gzip or raw) by using the format flag, as shown below:
# Downloads a volume from the defined VMExport object and, if necessary, decompresses it.\n$ virtctl vmexport download name --format=raw [flags]\n
"},{"location":"storage/export_api/#ttl-time-to-live","title":"TTL (Time to live)","text":"
TTL can also be added when creating a VMExport via virtctl
$ virtctl vmexport create name --ttl=1h\n
For more information about usage and examples:
$ virtctl vmexport --help\n\nExport a VM volume.\n\nUsage:\n virtctl vmexport [flags]\n\nExamples:\n # Create a VirtualMachineExport to export a volume from a virtual machine:\n virtctl vmexport create vm1-export --vm=vm1\n\n # Create a VirtualMachineExport to export a volume from a virtual machine snapshot\n virtctl vmexport create snap1-export --snapshot=snap1\n\n # Create a VirtualMachineExport to export a volume from a PVC\n virtctl vmexport create pvc1-export --pvc=pvc1\n\n # Delete a VirtualMachineExport resource\n virtctl vmexport delete snap1-export\n\n # Download a volume from an already existing VirtualMachineExport (--volume is optional when only one volume is available)\n virtctl vmexport download vm1-export --volume=volume1 --output=disk.img.gz\n\n # Create a VirtualMachineExport and download the requested volume from it\n virtctl vmexport download vm1-export --vm=vm1 --volume=volume1 --output=disk.img.gz\n\nFlags:\n -h, --help help for vmexport\n --insecure When used with the 'download' option, specifies that the http request should be insecure.\n --keep-vme When used with the 'download' option, specifies that the vmexport object should not be deleted after the download finishes.\n --output string Specifies the output path of the volume to be downloaded.\n --pvc string Sets PersistentVolumeClaim as vmexport kind and specifies the PVC name.\n --snapshot string Sets VirtualMachineSnapshot as vmexport kind and specifies the snapshot name.\n --vm string Sets VirtualMachine as vmexport kind and specifies the vm name.\n --volume string Specifies the volume to be downloaded.\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
"},{"location":"storage/export_api/#use-cases","title":"Use cases","text":""},{"location":"storage/export_api/#clone-vm-from-one-cluster-to-another-cluster","title":"Clone VM from one cluster to another cluster","text":"
If you want to transfer KubeVirt disk images from a source cluster to another target cluster, you can use the VMExport in the source to expose the disks and use Containerized Data Importer (CDI) in the target cluster to import the image into the target cluster. Let's assume we have an Ingress or Route in the source cluster that exposes the export proxy with the following example domain virt-exportproxy-example.example.com and we have a Virtual Machine in the source cluster with one disk, which looks like this:
This is a VM that has a DataVolume (DV) example-dv that is populated from a container disk and we want to export that disk to the target cluster. To export this VM we have to create a token that we can use in the target cluster to get access to the export, or we can let the export controller generate one for us. For example
Note in this example we are in the example namespace in the source cluster, which is why the internal links domain ends with .example.svc. The external links are what will be visible to outside of the source cluster, so we can use that for when we import into the target cluster.
Now we are ready to import this disk into the target cluster. In order for CDI to import, we will need to provide appropriate yaml that contains the following: - CA cert (as config map) - The token needed to access the disk images in a CDI compatible format - The VM yaml - DataVolume yaml (optional if not part of the VM definition)
virtctl provides an additional argument to the download command called --manifest that will retrieve the appropriate information from the export server, and either save it to a file with the --output argument or write to standard out. By default this output will not contain the header secret as it contains the token in plaintext. To get the header secret you specify the --include-secret argument. The default output format is yaml but it is possible to get json output as well.
Assuming there is a running VirtualMachineExport called example-export and the same namespace exists in the target cluster. The name of the kubeconfig of the target cluster is named kubeconfig-target, to clone the vm into the target cluster run the following commands:
The first command generates the yaml and writes it to import.yaml. The second command applies the generated yaml to the target cluster. It is possible to combine the two commands writing to standard out with the first command, and piping it into the second command. Use this option if the export token should not be written to a file anywhere. This will create the VM in the target cluster, and provides CDI in the target cluster with everything required to import the disk images.
After the import completes you should be able to start the VM in the target cluster.
"},{"location":"storage/export_api/#download-a-vm-volume-locally-using-virtctl-vmexport","title":"Download a VM volume locally using virtctl vmexport","text":"
Several steps from the previous section can be simplified considerably by using the vmexport command.
Again, let's assume we have an Ingress or Route in our cluster that exposes the export proxy, and that we have a Virtual Machine in the cluster with one disk like this:
Once we meet these requirements, the process of downloading the volume locally can be accomplished by different means:
"},{"location":"storage/export_api/#performing-each-step-separately","title":"Performing each step separately","text":"
We can download the volume by performing every single step in a different command. We start by creating the export object:
# We use an arbitrary name for the VMExport object, but specify our VM name in the flag.\n\n$ virtctl vmexport create vmexportname --vm=example-vm\n
Then, we download the volume in the specified output:
# Since our virtual machine only has one volume, there's no need to specify the volume name with the --volume flag.\n\n# After the download, the VMExport object is deleted by default, so we are using the optional --keep-vme flag to delete it manually.\n\n$ virtctl vmexport download vmexportname --output=/tmp/disk.img --keep-vme\n
Lastly, we delete the VMExport object:
$ virtctl vmexport delete vmexportname\n
"},{"location":"storage/export_api/#performing-one-single-step","title":"Performing one single step","text":"
All the previous steps can be simplified in one, single command:
# Since we are using a create flag (--vm) with download, the command creates the object assuming the VMExport doesn't exist.\n\n# Also, since we are not using --keep-vme, the VMExport object is deleted after the download.\n\n$ virtctl vmexport download vmexportname --vm=example-vm --output=/tmp/disk.img\n
After the download finishes, we can find our disk in /tmp/disk.img.
"},{"location":"storage/guestfs/","title":"Usage of libguestfs-tools and virtctl guestfs","text":"
Libguestfs tools are a set of utilities for accessing and modifying VM disk images. The command virtctl guestfs helps to deploy an interactive container with the libguestfs-tools and the PVC attached to it. This command is particularly useful if the users need to modify, inspect or debug VM disks on a PVC.
$ virtctl guestfs -h\nCreate a pod with libguestfs-tools, mount the pvc and attach a shell to it. The pvc is mounted under the /disks directory inside the pod for filesystem-based pvcs, or as /dev/vda for block-based pvcs\n\nUsage:\n virtctl guestfs [flags]\n\nExamples:\n # Create a pod with libguestfs-tools, mount the pvc and attach a shell to it:\n virtctl guestfs <pvc-name>\n\nFlags:\n -h, --help help for guestfs\n --image string libguestfs-tools container image\n --kvm Use kvm for the libguestfs-tools container (default true)\n --pull-policy string pull policy for the libguestfs image (default \"IfNotPresent\")\n\nUse \"virtctl options\" for a list of global command-line options (applies to all commands).\n
By default virtctl guestfs sets up kvm for the interactive container. This considerably speeds up the execution of the libguestfs-tools since they use QEMU. If the cluster doesn't have any kvm supporting nodes, the user must disable kvm by setting the option --kvm=false. If not set, the libguestfs-tools pod will remain pending because it cannot be scheduled on any node.
The command automatically uses the image exposed by KubeVirt under the http endpoint /apis/subresources.kubevirt.io/<kubevirt-version>/guestfs, but it can be configured to use a custom image by using the option --image. Users can also overwrite the pull policy of the image by setting the option pull-policy.
The command checks if a PVC is used by another pod in which case it will fail. However, once libguestfs-tools has started, the setup doesn't prevent a new pod starting and using the same PVC. The user needs to verify that there are no active virtctl guestfs pods before starting the VM which accesses the same PVC.
Currently, virtctl guestfs supports only a single PVC. Future versions might support multiple PVCs attached to the interactive pod.
"},{"location":"storage/guestfs/#examples-and-use-cases","title":"Examples and use-cases","text":"
Generally, the user can take advantage of the virtctl guestfs command for all typical usage of libguestfs-tools. It is strongly recommended to consult the official documentation. This command simply aims to help in configuring the correct containerized environment in the Kubernetes cluster where KubeVirt is installed.
For all the examples, the user has to start the interactive container by referencing the PVC in the virtctl guestfs command. This will deploy the interactive pod and attach the stdin and stdout.
Example:
$ virtctl guestfs pvc-test\nUse image: registry:5000/kubevirt/libguestfs-tools@sha256:6644792751b2ba9442e06475a809448b37d02d1937dbd15ad8da4d424b5c87dd \nThe PVC has been mounted at /disk \nWaiting for container libguestfs still in pending, reason: ContainerCreating, message: \nWaiting for container libguestfs still in pending, reason: ContainerCreating, message: \nbash-5.0#\n
Once the libguestfs-tools pod has been deployed, the user can access the disk and execute the desired commands. Later, once the user has completed the operations on the disk, simply exit the container and the pod be will automatically terminated.
Inspect the disk filesystem to retrive the version of the OS on the disk:
KubeVirt now supports hotplugging volumes into a running Virtual Machine Instance (VMI). The volume must be either a block volume or contain a disk image. When a VM that has hotplugged volumes is rebooted, the hotplugged volumes will be attached to the restarted VM. If the volumes are persisted they will become part of the VM spec, and will not be considered hotplugged. If they are not persisted, the volumes will be reattached as hotplugged volumes
Hotplug volume support must be enabled in the feature gates to be supported. The feature gates field in the KubeVirt CR must be expanded by adding the HotplugVolumes to it.
In order to hotplug a volume, you must first prepare a volume. This can be done by using a DataVolume (DV). In the example we will use a blank DV in order to add some extra storage to a running VMI
In this example we are using ReadWriteOnce accessMode, and the default FileSystem volume mode. Volume hotplugging supports all combinations of block volume mode and ReadWriteMany/ReadWriteOnce/ReadOnlyMany accessModes, if your storage supports the combination."},{"location":"storage/hotplug_volumes/#addvolume","title":"Addvolume","text":"
Now lets assume we have started a VMI like the Fedora VMI in examples and the name of the VMI is 'vmi-fedora'. We can add the above blank volume to this running VMI by using the 'addvolume' command available with virtctl
This will hotplug the volume into the running VMI, and set the serial of the disk to the volume name. In this example it is set to example-hotplug-volume.
The bus of hotplug disk is specified as a scsi disk. Why is it not specified as virtio instead, like regular disks? The reason is a limitation of virtio disks that each disk uses a pcie slot in the virtual machine and there is a maximum of 32 slots. This means there is a low limit on the maximum number of disks you can hotplug especially given that other things will also need pcie slots. Another issue is these slots need to be reserved ahead of time. So if the number of hotplugged disks is not known ahead of time, it is impossible to properly reserve the required number of slots. To work around this issue, each VM has a virtio-scsi controller, which allows the use of a scsi bus for hotplugged disks. This controller allows for hotplugging of over 4 million disks. virtio-scsi is very close in performance to virtio
The serial will be used in the guest so you can identify the disk inside the guest by the serial. For instance in Fedora the disk by id will contain the serial.
$ virtctl console vmi-fedora\n\nFedora 32 (Cloud Edition)\nKernel 5.6.6-300.fc32.x86_64 on an x86_64 (ttyS0)\n\nSSH host key: SHA256:c8ik1A9F4E7AxVrd6eE3vMNOcMcp6qBxsf8K30oC/C8 (ECDSA)\nSSH host key: SHA256:fOAKptNAH2NWGo2XhkaEtFHvOMfypv2t6KIPANev090 (ED25519)\neth0: 10.244.196.144 fe80::d8b7:51ff:fec4:7099\nvmi-fedora login:fedora\nPassword:fedora\n[fedora@vmi-fedora ~]$ ls /dev/disk/by-id\nscsi-0QEMU_QEMU_HARDDISK_1234567890\n[fedora@vmi-fedora ~]$ \n
As you can see the serial is part of the disk name, so you can uniquely identify it.
The format and length of serials are specified according to the libvirt documentation:
If present, this specify serial number of virtual hard drive. For example, it may look like <serial>WD-WMAP9A966149</serial>. Not supported for scsi-block devices, that is those using disk type 'block' using device 'lun' on bus 'scsi'. Since 0.7.1\n\n Note that depending on hypervisor and device type the serial number may be truncated silently. IDE/SATA devices are commonly limited to 20 characters. SCSI devices depending on hypervisor version are limited to 20, 36 or 247 characters.\n\n Hypervisors may also start rejecting overly long serials instead of truncating them in the future so it's advised to avoid the implicit truncation by testing the desired serial length range with the desired device and hypervisor combination.\n
"},{"location":"storage/hotplug_volumes/#supported-disk-types","title":"Supported Disk types","text":"
Kubevirt supports hotplugging disk devices of type disk and lun. As with other volumes, using type disk will expose the hotplugged volume as a regular disk, while using lun allows additional functionalities like the execution of iSCSI commands.
You can specify the desired type by using the --disk-type parameter, for example:
# Allowed values are lun and disk. If no option is specified, we use disk by default.\n$ virtctl addvolume vmi-fedora --volume-name=example-lun-hotplug --disk-type=lun\n
"},{"location":"storage/hotplug_volumes/#retain-hotplugged-volumes-after-restart","title":"Retain hotplugged volumes after restart","text":"
In many cases it is desirable to keep hotplugged volumes after a VM restart. It may also be desirable to be able to unplug these volumes after the restart. The persist option makes it impossible to unplug the disks after a restart. If you don't specify persist the default behaviour is to retain hotplugged volumes as hotplugged volumes after a VM restart. This makes the persist flag mostly obsolete unless you want to make a volume permanent on restart.
In some cases you want a hotplugged volume to become part of the standard disks after a restart of the VM. For instance if you added some permanent storage to the VM. We also assume that the running VMI has a matching VM that defines it specification. You can call the addvolume command with the --persist flag. This will update the VM domain disks section in addition to updating the VMI domain disks. This means that when you restart the VM, the disk is already defined in the VM, and thus in the new VMI.
VMI objects have a new status.VolumeStatus field. This is an array containing each disk, hotplugged or not. For example, after hotplugging the volume in the addvolume example, the VMI status will contain this:
Vda is the container disk that contains the Fedora OS, vdb is the cloudinit disk. As you can see those just contain the name and target used when assigning them to the VM. The target is the value passed to QEMU when specifying the disks. The value is unique for the VM and does NOT represent the naming inside the guest. For instance for a Windows Guest OS the target has no meaning. The same will be true for hotplugged volumes. The target is just a unique identifier meant for QEMU, inside the guest the disk can be assigned a different name.
The hotplugVolume has some extra information that regular volume statuses do not have. The attachPodName is the name of the pod that was used to attach the volume to the node the VMI is running on. If this pod is deleted it will also stop the VMI as we cannot guarantee the volume will remain attached to the node. The other fields are similar to conditions and indicate the status of the hot plug process. Once a Volume is ready it can be used by the VM.
Currently Live Migration is enabled for any VMI that has volumes hotplugged into it.
NOTE However there is a known issue that the migration may fail for VMIs with hotplugged block volumes if the target node uses CPU manager with static policy and runc prior to version v1.0.0.
KubeVirt leverages the VolumeSnapshot functionality of Kubernetes CSI drivers for capturing persistent VirtualMachine state. So, you should make sure that your VirtualMachine uses DataVolumes or PersistentVolumeClaims backed by a StorageClass that supports VolumeSnapshots and a VolumeSnapshotClass is properly configured for that StorageClass.
KubeVirt looks for Kubernetes Volume Snapshot related APIs/resources in the v1 version. To make sure that KubeVirt's snapshot controller is able to snapshot the VirtualMachine and referenced volumes as expected, Kubernetes Volume Snapshot APIs must be served from v1 version.
To list VolumeSnapshotClasses:
kubectl get volumesnapshotclass\n
Make sure the provisioner property of your StorageClass matches the driver property of the VolumeSnapshotClass
Even if you have no VolumeSnapshotClasses in your cluster, VirtualMachineSnapshots are not totally useless. They will still backup your VirtualMachine configuration.
Snapshot/Restore support must be enabled in the feature gates to be supported. The feature gates field in the KubeVirt CR must be expanded by adding the Snapshot to it.
"},{"location":"storage/snapshot_restore_api/#snapshot-a-virtualmachine","title":"Snapshot a VirtualMachine","text":"
Snapshotting a virtualMachine is supported for online and offline vms.
When snapshotting a running vm the controller will check for qemu guest agent in the vm. If the agent exists it will freeze the vm filesystems before taking the snapshot and unfreeze after the snapshot. It is recommended to take online snapshots with the guest agent for a better snapshot, if not present a best effort snapshot will be taken.
Note To check if your vm has a qemu-guest-agent check for 'AgentConnected' in the vm status.
There will be an indication in the vmSnapshot status if the snapshot was taken online and with or without guest agent participation.
Note Online snapshot with hotplugged disks is supported, only persistent hotplugged disks will be included in the snapshot.
To snapshot a VirtualMachine named larry, apply the following yaml.
You can check the vmSnapshot phase in the vmSnapshot status. It can be one of the following:
InProgress
Succeeded
Failed.
The vmSnapshot has a default deadline of 5 minutes. If the vmSnapshot has not succeessfully completed before the deadline, it will be marked as Failed. The VM will be unfrozen and the created snapshot content will be cleaned up if necessary. The vmSnapshot object will remain in Failed state until deleted by the user. To change the default deadline add 'FailureDeadline' to the VirtualMachineSnapshot spec with a new value. The allowed format is a duration string which is a possibly signed sequence of decimal numbers, each with optional fraction and a unit suffix, such as \"300ms\", \"-1.5h\" or \"2h45m\"
Keep VirtualMachineSnapshots (and their corresponding VirtualMachineSnapshotContents) around as long as you may want to restore from them again.
Feel free to delete restore-larry as it is not needed once the restore is complete.
"},{"location":"storage/volume_migration/","title":"Migration update volume strategy and volume migration","text":"
Storage migration is possible while the VM is running by using the update volume strategy. Storage migration can be useful in the cases where the users need to change the underlying storage, for example, if the storage class has been deprecated, or there is a new more performant driver available.
This feature doesn't handle the volume creation or cover migration between storage classes, but rather implements a basic API which can be used by overlaying tools to perform more advanced migration planning.
If Migration is specified as updateVolumesStrategy, KubeVirt will try to migrate the storage from the old volume set to the new one when the VirtualMachine spec is updated. The migration considers the changed volumes present into a single update. A single update may contain modifications to more than one volume, but sequential changes to the volume set will be handled as separate migrations.
Updates are declarative and GitOps compatible. For example, a new version of the VM specification with the new volume set and the migration volume update strategy can be directly applied using kubectl apply or interactively editing the VM with kubectl edit
Example: Original VM with a datavolume and datavolume template:
The destination volume may be of a different type or size than the source. It is possible to migrate from and to a block volume as well as a filesystem volume. The destination volume should be equal to or larger than the source volume. However, the additional difference in the size of the destination volume is not instantly visible within the VM and must be manually resized because the guest is unaware of the migration.
The volume migration depends on the VolumeMigration and VolumesUpdateStrategy feature gates and the LiveMigrate workloadUpdateStrategy. To fully enable the feature, add the following to the KubeVirt CR:
The volume migration progress can be monitored by watching the corresponding VirtualMachineInstanceMigration object using the label kubevirt.io/volume-update-in-progress: <vm-name>. Example:
Updating a datavolume that is referenced by a datavolume template requires special caution. The volumes section must include a reference to the name of the datavolume template. This means that the datavolume templates must either be entirely deleted or updated as well.
Example of updating the datavolume for the original VM in the first example:
Only certain types of disks and volumes are supported to be migrated. For an invalid type of volume the RestartRequired condition is set and volumes will be replaced upon VM restart. Currently, the volume migration is supported between PersistentVolumeClaims and Datavolumes. Additionally, volume migration is forbidden if the disk is: * shareable, since it cannot guarantee the data consistency with multiple writers * hotpluggable, this case isn't currently supported * filesystem, since virtiofs doesn't currently support live-migration * lun, originally the disk might support SCSI protocol but the destination PVC class does not. This case isn't currently supported.
Currently, KubeVirt only enables live migration between separate nodes. Volume migration relies on live migration; hence, live migrating storage on the same node is also not possible. Volume migration is possible between local storage, like between 2 PVCs with RWO access mode, but they need to be located on two different host.
"},{"location":"user_workloads/accessing_virtual_machines/","title":"Accessing Virtual Machines","text":""},{"location":"user_workloads/accessing_virtual_machines/#graphical-and-serial-console-access","title":"Graphical and Serial Console Access","text":"
Once a virtual machine is started you are able to connect to the consoles it exposes. Usually there are two types of consoles:
Serial Console
Graphical Console (VNC)
Note: You need to have virtctl installed to gain access to the VirtualMachineInstance.
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-serial-console","title":"Accessing the Serial Console","text":"
The serial console of a virtual machine can be accessed by using the console command:
virtctl console testvm\n
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-graphical-console-vnc","title":"Accessing the Graphical Console (VNC)","text":"
To access the graphical console of a virtual machine the VNC protocol is typically used. This requires remote-viewer to be installed. Once the tool is installed, you can access the graphical console using:
virtctl vnc testvm\n
If you only want to open a vnc-proxy without executing the remote-viewer command, it can be accomplished with:
virtctl vnc --proxy-only testvm\n
This would print the port number on your machine where you can manually connect using any VNC viewer.
If the connection fails, you can use the -v flag to get more verbose output from both virtctl and the remote-viewer tool to troubleshoot the problem.
virtctl vnc testvm -v 4\n
Note: If you are using virtctl via SSH on a remote machine, you need to forward the X session to your machine. Look up the -X and -Y flags of ssh if you are not familiar with that. As an alternative you can proxy the API server port with SSH to your machine (either direct or in combination with kubectl proxy).
A common operational pattern used when managing virtual machines is to inject SSH public keys into the virtual machines at boot. This allows automation tools (like Ansible) to provision the virtual machine. It also gives operators a way of gaining secure and passwordless access to a virtual machine.
KubeVirt provides multiple ways to inject SSH public keys into a virtual machine.
In general, these methods fall into two categories: - Static key injection, which places keys on the virtual machine the first time it is booted. - Dynamic key injection, which allows keys to be dynamically updated both at boot and during runtime.
Once a SSH public key is injected into the virtual machine, it can be accessed via virtctl.
"},{"location":"user_workloads/accessing_virtual_machines/#static-ssh-public-key-injection-via-cloud-init","title":"Static SSH public key injection via cloud-init","text":"
Users creating virtual machines can provide startup scripts to their virtual machines, allowing multiple customization operations.
One option for injecting public SSH keys into a VM is via cloud-init startup script. However, there are more flexible options available.
The virtual machine's access credential API allows statically injecting SSH public keys at startup time independently of the cloud-init user data by placing the SSH public key into a Kubernetes Secret. This allows keeping the application data in the cloud-init user data separate from the credentials used to access the virtual machine.
A Kubernetes Secret can be created from an SSH public key like this:
# Place SSH public key into a Secret\nkubectl create secret generic my-pub-key --from-file=key1=id_rsa.pub\n
The Secret containing the public key is then assigned to a virtual machine using the access credentials API with the noCloud propagation method.
KubeVirt injects the SSH public key into the virtual machine by using the generated cloud-init metadata instead of the user data. This separates the application user data and user credentials.
"},{"location":"user_workloads/accessing_virtual_machines/#dynamic-ssh-public-key-injection-via-qemu-guest-agent","title":"Dynamic SSH public key injection via qemu-guest-agent","text":"
KubeVirt allows the dynamic injection of SSH public keys into a VirtualMachine with the access credentials API.
Utilizing the qemuGuestAgent propagation method, configured Secrets are attached to a VirtualMachine when the VM is started. This allows for dynamic injection of SSH public keys at runtime by updating the attached Secrets.
Please note that new Secrets cannot be attached to a running VM: You must restart the VM to attach the new Secret.
Note: This requires the qemu-guest-agent to be installed within the guest.
Note: When using qemuGuestAgent propagation, the /home/$USER/.ssh/authorized_keys file will be owned by the guest agent. Changes to the file not made by the guest agent will be lost.
Note: More information about the motivation behind the access credentials API can be found in the pull request description that introduced the API.
In the example below the Secret containing the SSH public key is attached to the virtual machine via the access credentials API with the qemuGuestAgent propagation method. This allows updating the contents of the Secret at any time, which will result in the changes getting applied to the running virtual machine immediately. The Secret may also contain multiple SSH public keys.
# Place SSH public key into a secret\nkubectl create secret generic my-pub-key --from-file=key1=id_rsa.pub\n
Now reference this secret in the VirtualMachine spec with the access credentials API using qemuGuestAgent propagation.
# Create a VM referencing the Secret using propagation method qemuGuestAgent\nkubectl create -f - <<EOF\napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n name: testvm\nspec:\n runStrategy: Always\n template:\n spec:\n domain:\n devices:\n disks:\n - disk:\n bus: virtio\n name: containerdisk\n - disk:\n bus: virtio\n name: cloudinitdisk\n rng: {}\n resources:\n requests:\n memory: 1024M\n terminationGracePeriodSeconds: 0\n accessCredentials:\n - sshPublicKey:\n source:\n secret:\n secretName: my-pub-key\n propagationMethod:\n qemuGuestAgent:\n users:\n - fedora\n volumes:\n - containerDisk:\n image: quay.io/containerdisks/fedora:latest\n name: containerdisk\n - cloudInitNoCloud:\n userData: |-\n #cloud-config\n password: fedora\n chpasswd: { expire: False }\n # Disable SELinux for now, so qemu-guest-agent can write the authorized_keys file\n # The selinux-policy is too restrictive currently, see open bugs:\n # - https://bugzilla.redhat.com/show_bug.cgi?id=1917024\n # - https://bugzilla.redhat.com/show_bug.cgi?id=2028762\n # - https://bugzilla.redhat.com/show_bug.cgi?id=2057310\n bootcmd:\n - setenforce 0\n name: cloudinitdisk\nEOF\n
"},{"location":"user_workloads/accessing_virtual_machines/#accessing-the-vmi-using-virtctl","title":"Accessing the VMI using virtctl","text":"
The user can create a websocket backed network tunnel to a port inside the instance by using the virtualmachineinstances/portforward subresource of the VirtualMachineInstance.
One use-case for this subresource is to forward SSH traffic into the VirtualMachineInstance either from the CLI or a web-UI.
To connect to a VirtualMachineInstance from your local machine, virtctl provides a lightweight SSH client with the ssh command, that uses port forwarding. Refer to the command's help for more details.
virtctl ssh\n
To transfer files from or to a VirtualMachineInstancevirtctl also provides a lightweight SCP client with the scp command. Its usage is similar to the ssh command. Refer to the command's help for more details.
virtctl scp\n
"},{"location":"user_workloads/accessing_virtual_machines/#using-virtctl-as-proxy","title":"Using virtctl as proxy","text":"
If you prefer to use your local OpenSSH client, there are two ways of doing that in combination with virtctl.
Note: Most of this applies to the virtctl scp command too.
The virtctl ssh command has a --local-ssh option. With this option virtctl wraps the local OpenSSH client transparently to the user. The executed SSH command can be viewed by increasing the verbosity (-v 3).
virtctl ssh --local-ssh -v 3 testvm\n
The virtctl port-forward command provides an option to tunnel a single port to your local stdout/stdin. This allows the command to be used in combination with the OpenSSH client's ProxyCommand option.
This allows you to simply call ssh user@vmi/testvmi.mynamespace and your SSH config and virtctl will do the rest. Using this method it becomes easy to set up different identities for different namespaces inside your SSH config.
This feature can also be used with Ansible to automate configuration of virtual machines running on KubeVirt. You can put the snippet above into its own file (e.g. ~/.ssh/virtctl-proxy-config) and add the following lines to your .ansible.cfg:
Note that all port forwarding traffic will be sent over the Kubernetes control plane. A high amount of connections and traffic can increase pressure on the API server. If you regularly need a high amount of connections and traffic consider using a dedicated Kubernetes Service instead.
"},{"location":"user_workloads/accessing_virtual_machines/#rbac-permissions-for-consolevncssh-access","title":"RBAC permissions for Console/VNC/SSH access","text":""},{"location":"user_workloads/accessing_virtual_machines/#using-default-rbac-cluster-roles","title":"Using default RBAC cluster roles","text":"
Every KubeVirt installation starting with version v0.5.1 ships a set of default RBAC cluster roles that can be used to grant users access to VirtualMachineInstances.
The kubevirt.io:admin and kubevirt.io:edit cluster roles have console, VNC and SSH respectively port-forwarding access permissions built into them. By binding either of these roles to a user, they will have the ability to use virtctl to access the console, VNC and SSH.
The default KubeVirt cluster roles grant access to more than just the console, VNC and port-forwarding. The ClusterRole below demonstrates how to craft a custom role, that only allows access to the console, VNC and port-forwarding.
KubeVirt does not come with a UI, it is only extending the Kubernetes API with virtualization functionality.
"},{"location":"user_workloads/boot_from_external_source/","title":"Booting From External Source","text":"
When installing a new guest virtual machine OS, it is often useful to boot directly from a kernel and initrd stored in the host physical machine OS, allowing command line arguments to be passed directly to the installer.
Booting from an external source is supported in Kubevirt starting from version v0.42.0-rc.0. This enables the capability to define a Virtual Machine that will use a custom kernel / initrd binary, with possible custom arguments, during its boot process.
The binaries are provided though a container image. The container is pulled from the container registry and resides on the local node hosting the VMs.
Some use cases for this may be: - For a kernel developer it may be very convenient to launch VMs that are defined to boot from the latest kernel binary that is often being changed. - Initrd can be set with files that need to reside on-memory during all the VM's life-cycle.
initrdPath and kernelPath define the path for the binaries inside the container.
Kernel and Initrd binaries must be owned by qemu user & group.
To change ownership: chown qemu:qemu <binary> when <binary> is the binary file.
kernelArgs can only be provided if a kernel binary is provided (i.e. kernelPath not defined). These arguments will be passed to the default kernel the VM boots from.
imagePullSecret and imagePullPolicy are optional
if imagePullPolicy is Always and the container image is updated then the VM will be booted into the new kernel when VM restarts
All KubeVirt system-components expose Prometheus metrics at their /metrics REST endpoint.
You can consult the complete and up-to-date metric list at kubevirt/monitoring.
"},{"location":"user_workloads/component_monitoring/#custom-service-discovery","title":"Custom Service Discovery","text":"
Prometheus supports service discovery based on Pods and Endpoints out of the box. Both can be used to discover KubeVirt services.
All Pods which expose metrics are labeled with prometheus.kubevirt.io and contain a port-definition which is called metrics. In the KubeVirt release-manifests, the default metrics port is 8443.
The above labels and port informations are collected by a Service called kubevirt-prometheus-metrics. Kubernetes automatically creates a corresponding Endpoint with an equal name:
By watching this endpoint for added and removed IPs to subsets.addresses and appending the metrics port from subsets.ports, it is possible to always get a complete list of ready-to-be-scraped Prometheus targets.
"},{"location":"user_workloads/component_monitoring/#integrating-with-the-prometheus-operator","title":"Integrating with the prometheus-operator","text":"
The prometheus-operator can make use of the kubevirt-prometheus-metrics service to automatically create the appropriate Prometheus config.
KubeVirt's virt-operator checks if the ServiceMonitor custom resource exists when creating an install strategy for deployment. KubeVirt will automatically create a ServiceMonitor resource in the monitorNamespace, as well as an appropriate role and rolebinding in KubeVirt's namespace.
Three settings are exposed in the KubeVirt custom resource to direct KubeVirt to create these resources correctly:
monitorNamespace: The namespace that prometheus-operator runs in. Defaults to openshift-monitoring.
monitorAccount: The serviceAccount that prometheus-operator runs with. Defaults to prometheus-k8s.
serviceMonitorNamespace: The namespace that the serviceMonitor runs in. Defaults to be monitorNamespace
Please note that if you decide to set serviceMonitorNamespace than this namespace must be included in serviceMonitorNamespaceSelector field of Prometheus spec.
If the prometheus-operator for a given deployment uses these defaults, then these values can be omitted.
An example of the KubeVirt resource depicting these default values:
"},{"location":"user_workloads/component_monitoring/#integrating-with-the-okd-cluster-monitoring-operator","title":"Integrating with the OKD cluster-monitoring-operator","text":"
After the cluster-monitoring-operator is up and running, KubeVirt will detect the existence of the ServiceMonitor resource. Because the definition contains the openshift.io/cluster-monitoring label, it will automatically be picked up by the cluster monitor.
"},{"location":"user_workloads/component_monitoring/#metrics-about-virtual-machines","title":"Metrics about Virtual Machines","text":"
The endpoints report metrics related to the runtime behaviour of the Virtual Machines. All the relevant metrics are prefixed with kubevirt_vmi.
The metrics have labels that allow to connect to the VMI objects they refer to. At minimum, the labels will expose node, name and namespace of the related VMI object.
Please note the domain label in the above example. This label is deprecated and it will be removed in a future release. You should identify the VMI using the node, namespace, name labels instead.
"},{"location":"user_workloads/component_monitoring/#important-queries","title":"Important Queries","text":""},{"location":"user_workloads/component_monitoring/#detecting-connection-issues-for-the-rest-client","title":"Detecting connection issues for the REST client","text":"
Use the following query to get a counter for all REST call which indicate connection issues:
rest_client_requests_total{code=\"<error>\"}\n
If this counter is continuously increasing, it is an indicator that the corresponding KubeVirt component has general issues to connect to the apiserver
"},{"location":"user_workloads/creating_it_pref/","title":"Creating Instance Types and Preferences by using virtctl","text":"
As of KubeVirt v1.0, you can use virtctl subcommands to create instance types and preferences.
The virtctl subcommand create instancetype allows easy creation of an instance type manifest from the command line. The command also provides several flags that can be used to create your desired manifest.
There are two required flags that need to be specified:
--cpu: the number of vCPUs to be requested
--memory: the amount of memory to be requested
Additionally, there are several optional flags that can be used, such as specifying a list of GPUs for passthrough, choosing the desired IOThreadsPolicy, or simply providing the name of our instance type.
By default, the command creates cluster-wide instance types. If the user wants to create the namespaced version, they need to provide the namespaced flag. The namespace name can be specified by using the --namespace flag.
For a complete list of flags and their descriptions, use the following command:
The virtctl subcommand create preference allows easy creation of a preference manifest from the command line. This command serves as a starting point to create the basic structure of a preference manifest, as it does not allow specifying all the options that are supported in preferences.
The current set of flags allows us, for example, to specify the preferred CPU topology, machine type or a storage class.
By default, the command creates cluster-wide preferences. If the user wants to create the namespaced version, they need to provide the namespaced flag. The namespace name can be specified by using the --namespace flag.
For a complete list of flags and their descriptions, use the following command:
"},{"location":"user_workloads/creating_vms/","title":"Creating VirtualMachines by using virtctl","text":"
The virtctl sub command create vm allows easy creation of VirtualMachine manifests from the command line. It leverages instance types and preferences and inference by default (see Using instance types and preferences) and it provides several flags to control details of the created virtual machine.
For example there are flags to specify the name or run strategy of a virtual machine or flags to add volumes to a virtual machine. Instance types and preferences can either be specified directly or it is possible to let KubeVirt infer those from the volume used to boot the virtual machine.
For a full set of flags and their description use the following command:
virtctl create vm -h\n
"},{"location":"user_workloads/creating_vms/#creating-virtualmachines-on-a-cluster","title":"Creating VirtualMachines on a cluster","text":"
The output of virtctl create vm can be piped directly into kubectl to create a VirtualMachine on a cluster, e.g.:
# Create a VM with name my-vm on the cluster\nvirtctl create vm --name my-vm | kubectl create -f -\nvirtualmachine.kubevirt.io/my-vm created\n
"},{"location":"user_workloads/creating_vms/#using-instance-types-and-preferences","title":"Using instance types and preferences","text":"
Instance types and preferences can be used with the appropriate flags. If they are not otherwise specified, instance types and preferences are inferred from the boot volume of a virtual machine by default. For more information about inference, see below.
The following example creates a VM specifying an instance type and preference by using the appropriate flags:
virtctl create vm --instancetype my-instancetype --preference my-preference\n
The type of the instance type or preference (namespaced or cluster scope) can be controlled by prefixing the instance type or preference name with the corresponding CRD name, e.g.:
# Using a cluster scoped instance type and a namespaced preference\nvirtctl create vm \\\n --instancetype virtualmachineclusterinstancetype/my-instancetype \\\n --preference virtualmachinepreference/my-preference\n
If a prefix was not supplied the cluster scoped resources will be used by default.
"},{"location":"user_workloads/creating_vms/#inference-of-instance-type-andor-preference","title":"Inference of instance type and/or preference","text":"
To explicitly infer instance types and/or preferences from the volume used to boot the virtual machine add the following flags:
virtctl create vm --infer-instancetype --infer-preference\n
The implicit default is to always try to infer an instance type and preference from the boot volume. This feature makes use of the IgnoreInferFromVolumeFailure policy, which suppresses failures on inference of instance types and preferences. If one of the above switches has been explicitly specified, the RejectInferFromVolumeFailure policy is used instead. This way users are made aware of potential issues during the virtual machine creation.
To infer an instance type or preference from another volume than the volume used to boot the virtual machine, use the --infer-instancetype-from and --infer-preference-from flags to specify any of the virtual machine's volumes.
# This virtual machine will boot from volume-a, but the instance type and\n# preference are inferred from volume-b.\nvirtctl create vm \\\n --volume-import=type:pvc,src:my-ns/my-pvc-a,name:volume-a \\\n --volume-import=type:pvc,src:my-ns/my-pvc-b,name:volume-b \\\n --infer-instancetype-from volume-b \\\n --infer-preference-from volume-b\n
"},{"location":"user_workloads/creating_vms/#boot-order-of-added-volumes","title":"Boot order of added volumes","text":"
Please note that volumes of different kinds currently have the following fixed boot order regardless of the order their flags were specified on the command line:
Containerdisks
Directly used PVCs
DataSources
Cloned PVCs
Blank volumes
Imported volumes (through the --volume-import flag)
If multiple volumes of the same kind were specified their order is determined by the order in which their flags were specified.
"},{"location":"user_workloads/creating_vms/#generating-cloud-init-user-data","title":"Generating cloud-init user data","text":"
To generate cloud-init user data with virtctl create vm the following flags can be used.
Note
Generating cloud-init user data is mutually exclusive with specifying custom cloud-init user data, as explained below.
Specify a file to read the password for the virtual machine's main user from. In the generated cloud-init user data, it sets the value of the password parameter to the read in value and the value of the chpasswd parameter to { expire: False }.
Specify one or more SSH authorized keys for the virtual machine's main user. It sets the ssh_authorized_keys parameter in the generated cloud-init user data.
When this flag is set, a command enabling the qemu-guest-agent to manage SSH authorized keys is added to the generated cloud-init user data. The command is added to the runcmd parameter which is required on SELinux enabled distributions that would otherwise not allow the qemu-guest-agent to manage SSH authorized keys in the home directories of users.
By passing the --ga-manage-ssh flag explicitly, the qemu-guest-agent is able to manage the credentials read from the Secret my-keys specified as source parameter to the --access-cred flag. Note that if --ga-manage-ssh was not explicitly set to false, this is also the default behavior.
"},{"location":"user_workloads/creating_vms/#specifying-custom-cloud-init-user-data","title":"Specifying custom cloud-init user data","text":"
To pass custom cloud-init user data to virtctl it needs to be encoded into a base64 string.
Note
Specifying custom cloud-init user data is mutually exclusive with generating cloud-init user data, as explained above.
Here is an example how to do it:
# Put your cloud-init user data into a file.\n# This will add an authorized key to the default user.\n# To get the default username read the documentation for the cloud image\n$ cat cloud-init.txt\n#cloud-config\nssh_authorized_keys:\n - ssh-rsa AAAA...\n\n# Base64 encode the contents of the file without line wraps and store it in a variable\n$ CLOUD_INIT_USERDATA=$(base64 -w 0 cloud-init.txt)\n\n# Show the contents of the variable\n$ echo $CLOUD_INIT_USERDATA\nI2Nsb3VkLWNvbmZpZwpzc2hfYXV0aG9yaXplZF9rZXlzOgogIC0gc3NoLXJzYSBBQUFBLi4uCg==\n
You can now use this variable as an argument to the --cloud-init-user-data flag:
virtctl create vm --cloud-init-user-data $CLOUD_INIT_USERDATA\n
"},{"location":"user_workloads/creating_vms/#adding-access-credentials-to-a-virtual-machine","title":"Adding access credentials to a virtual machine","text":"
By using the --access-cred flag, the virtctl create vm command can configure access credentials in a created virtual machine. It supports SSH authorized key and password access credentials and can configure them to be injected either through the qemu-guest-agent or through cloud-init metadata. The supported parameters of the flag depend on the chosen type and method. The flag can be passed multiple times to configure more than one access credential.
This flag interacts with the flags used to generate cloud-init user data, namely it inherits the same --user for SSH key injection, and it enables qemu-guest-agent to manage SSH authorized keys (--ga-manage-ssh), if it is not explicitly disabled by the user.
Create a manifest for a VirtualMachine with a random name:
virtctl create vm\n
Create a manifest for a VirtualMachine with a specified name and RunStrategy Always:
virtctl create vm --name=my-vm --run-strategy=Always\n
Create a manifest for a VirtualMachine with a specified VirtualMachineClusterInstancetype:
virtctl create vm --instancetype=my-instancetype\n
Create a manifest for a VirtualMachine with a specified VirtualMachineInstancetype (namespaced):
virtctl create vm --instancetype=virtualmachineinstancetype/my-instancetype\n
Create a manifest for a VirtualMachine with a specified VirtualMachineClusterPreference:
virtctl create vm --preference=my-preference\n
Create a manifest for a VirtualMachine with a specified VirtualMachinePreference (namespaced):
virtctl create vm --preference=virtualmachinepreference/my-preference\n
Create a manifest for a VirtualMachine with specified memory and an ephemeral containerdisk volume:
virtctl create vm --memory=1Gi \\\n --volume-containerdisk=src:my.registry/my-image:my-tag\n
Create a manifest for a VirtualMachine with a cloned DataSource in namespace and specified size:
virtctl create vm --volume-import=type:ds,src:my-ns/my-ds,size:50Gi\n
Create a manifest for a VirtualMachine with a cloned DataSource and inferred instance type and preference:
virtctl create vm --volume-import=type:ds,src:my-annotated-ds \\\n --infer-instancetype --infer-preference\n
Create a manifest for a VirtualMachine with multiple volumes and specified boot order:
virtctl create vm --volume-containerdisk=src:my.registry/my-image:my-tag \\\n --volume-import=type:ds,src:my-ds,bootorder:1\n
Create a manifest for a VirtualMachine with multiple volumes and inferred instance type and preference with specified volumes:
virtctl create vm --volume-import=type:ds,src:my-annotated-ds \\\n --volume-pvc=my-annotated-pvc --infer-instancetype=my-annotated-ds \\\n --infer-preference=my-annotated-pvc\n
Create a manifest for a VirtualMachine with a cloned PVC:
virtctl create vm --volume-import=type:pvc,src:my-ns/my-pvc\n
Create a manifest for a VirtualMachine with a directly used PVC:
virtctl create vm --volume-pvc=src:my-pvc\n
Create a manifest for a VirtualMachine with a clone DataSource and a blank volume:
virtctl create vm --volume-import=type:ds,src:my-ns/my-ds \\\n --volume-import=type:blank,size:50Gi\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and cloned DataSource:
virtctl create vm --instancetype=my-instancetype --preference=my-preference \\\n --volume-import=type:ds,src:my-ds\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and two cloned DataSources (flag can be provided multiple times):
virtctl create vm --instancetype=my-instancetype --preference=my-preference \\\n --volume-import=type:ds,src:my-ds1 --volume-import=type:ds,src:my-ds2\n
Create a manifest for a VirtualMachine with a specified VirtualMachineCluster{Instancetype,Preference} and directly used PVC:
virtctl create vm --instancetype=my-instancetype --preference=my-preference \\\n --volume-pvc=my-pvc\n
Create a manifest for a VirtualMachine with a specified DataVolumeTemplate:
virtctl create vm \\\n --volume-import=type:pvc,name:my-pvc,namespace:default,size:256Mi\n
Create a manifest for a VirtualMachine with a generated cloud-init config setting the user and adding an ssh authorized key:
virtctl create vm --user=cloud-user --ssh-key=\"ssh-ed25519 AAAA....\"\n
Create a manifest for a VirtualMachine with a generated cloud-init config setting the user and setting the password from a file:
virtctl create vm --user=cloud-user --password-file=/path/to/file\n
Create a manifest for a VirtualMachine with SSH public keys injected into the VM from a secret called my-keys to the user also specified in the cloud-init config:
virtctl create vm --user=cloud-user --access-cred=type:ssh,src:my-keys\n
Create a manifest for a VirtualMachine with SSH public keys injected into the VM from a secret called my-keys to a user specified as param:
virtctl create vm --access-cred=type:ssh,src:my-keys,user:myuser\n
Create a manifest for a VirtualMachine with password injected into the VM from a secret called my-pws:
virtctl create vm --access-cred=type:password,src:my-pws\n
Create a manifest for a VirtualMachine with a Containerdisk and a Sysprep volume (source ConfigMap needs to exist):
virtctl create vm --memory=1Gi \\\n --volume-containerdisk=src:my.registry/my-image:my-tag --sysprep=src:my-cm\n
Creating a VirtualMachine with the following settings and using a secret for configuring access credentials:
Instancetype: u1.small
Prefernce: fedora
Using the quay.io/containerdisks/fedora containerdisk as first volume
Adding a second blank volume with a size of 1Gi
The main user is named myuser
Logins with the main user are possible with the specified authorized key in the access credentials
# First create the secret with the public key:\nkubectl create secret generic my-keys --from-file=$HOME/.ssh/id_ed25519.pub\n\n# Then create the VM on the cluster\nvirtctl create vm --name my-vm --instancetype=u1.small --preference=fedora \\\n --volume-containerdisk=src:quay.io/containerdisks/fedora \\\n --volume-import=type:blank,size:1Gi --user=myuser \\\n --access-cred=src:my-keys | kubectl create -f -\n\n# Login via SSH once the VM is ready\nvirtctl ssh -i $HOME/.ssh/id_ed25519 myuser@my-vm\n
The kubevirt/common-instancetypes provide a set of instancetypes and preferences to help create KubeVirt VirtualMachines.
Beginning with the 1.1 release of KubeVirt, cluster wide resources can be deployed directly through KubeVirt, without another operator. This allows deployment of a set of default instancetypes and preferences along side KubeVirt.
"},{"location":"user_workloads/deploy_common_instancetypes/#enable-automatic-deployment-of-common-instancetypes","title":"Enable automatic deployment of common-instancetypes","text":"
To enable the deployment of cluster-wide common-instancetypes through the KubeVirt virt-operator, the CommonInstancetypesDeploymentGate feature gate needs to be enabled.
For customization purposes or to install namespaced resources, common-instancetypes can also be deployed by hand.
To install all resources provided by the kubevirt/common-instancetypes project without further customizations, simply apply with kustomize enabled (-k flag):
Guest Agent (GA) is an optional component that can run inside of Virtual Machines. The GA provides plenty of additional runtime information about the running operating system (OS). More technical detail about available GA commands is available here.
"},{"location":"user_workloads/guest_agent_information/#guest-agent-info-in-virtual-machine-status","title":"Guest Agent info in Virtual Machine status","text":"
GA presence in the Virtual Machine is signaled with a condition in the VirtualMachineInstance status. The condition tells that the GA is connected and can be used.
When the Guest Agent is not present in the Virtual Machine, the Guest Agent information is not shown. No error is reported because the Guest Agent is an optional component.
The infoSource field indicates where the info is gathered from. Valid values:
domain: the info is based on the domain spec
guest-agent: the info is based on Guest Agent report
domain, guest-agent: the info is based on both the domain spec and the Guest Agent report
"},{"location":"user_workloads/guest_agent_information/#guest-agent-info-available-through-the-api","title":"Guest Agent info available through the API","text":"
The data shown in the VirtualMachineInstance status are a subset of the information available. The rest of the data is available via the REST API exposed in the Kubernetes kube-api server.
There are three new subresources added to the VirtualMachineInstance object:
- guestosinfo\n- userlist\n- filesystemlist\n
The whole GA data is returned via guestosinfo subresource available behind the API endpoint.
"},{"location":"user_workloads/guest_operating_system_information/#use-with-presets","title":"Use with presets","text":"
A VirtualMachineInstancePreset representing an operating system with a kubevirt.io/os label could be applied on any given VirtualMachineInstance that have and match the kubevirt.io/os label.
Default presets for the OS identifiers above are included in the current release.
"},{"location":"user_workloads/guest_operating_system_information/#windows-server-2012r2-virtualmachineinstancepreset-example","title":"Windows Server 2012R2 VirtualMachineInstancePreset Example","text":"
KubeVirt supports quite a lot of so-called \"HyperV enlightenments\", which are optimizations for Windows Guests. Some of these optimization may require an up to date host kernel support to work properly, or to deliver the maximum performance gains.
KubeVirt can perform extra checks on the hosts before to run Hyper-V enabled VMs, to make sure the host has no known issues with Hyper-V support, properly expose all the required features and thus we can expect optimal performance. These checks are disabled by default for backward compatibility and because they depend on the node-feature-discovery and on extra configuration.
To enable strict host checking, the user may expand the featureGates field in the KubeVirt CR by adding the HypervStrictCheck to it.
In KubeVirt, a Hook Sidecar container is a sidecar container (a secondary container that runs along with the main application container within the same Pod) used to apply customizations before the Virtual Machine is initialized. This ability is provided since configurable elements in the VMI specification do not cover all of the libvirt domain XML elements.
The sidecar containers communicate with the main container over a socket with a gRPC protocol. There are two main sidecar hooks:
onDefineDomain: This hook helps to customize libvirt's XML and return the new XML over gRPC for the VM creation.
preCloudInitIso: This hook helps to customize the cloud-init configuration. It operates on and returns JSON formatted cloud-init data.
To run a VM with custom modifications, the sidecar-shim-image takes care of implementing the communication with the main container.
The image contains the sidecar-shim binary built using sidecar_shim.go which should be kept as the entrypoint of the container. This binary will search in $PATH for binaries named after the hook names (e.g onDefineDomain and preCloudInitIso) and run them. Users must provide the necessary arguments as command line options (flags).
In the case of onDefineDomain, the arguments will be the VMI information as JSON string, (e.g --vmi vmiJSON) and the current domain XML (e.g --domain domainXML). It outputs the modified domain XML on the standard output.
In the case of preCloudInitIso, the arguments will be the VMI information as JSON string, (e.g --vmi vmiJSON) and the CloudInitData (e.g --cloud-init cloudInitJSON). It outputs the modified CloudInitData (as JSON) on the standard ouput.
Shell or python scripts can be used as alternatives to the binary, by making them available at the expected location (/usr/bin/onDefineDomain or /usr/bin/preCloudInitIso depending upon the hook).
A prebuilt image named sidecar-shim capable of running Shell or Python scripts is shipped as part of KubeVirt releases.
"},{"location":"user_workloads/hook-sidecar/#go-python-shell-pick-any-one","title":"Go, Python, Shell - pick any one","text":"
Although a binary doesn't strictly need to be generated from Go code, and a script doesn't strictly need to be one among Shell or Python, for the purpose of this guide, we will use those as examples.
Example Go code modifiying the SMBIOS system information can be found in the KubeVirt repo. Binary generated from this code, when available under /usr/bin/ondefinedomain in the sidecar-shim-image, is run right before VMI creation and the baseboard manufacturer value is modified to reflect what's provided in the smbios.vm.kubevirt.io/baseBoardManufacturer annotation in VMI spec.
"},{"location":"user_workloads/hook-sidecar/#shell-or-python-script","title":"Shell or Python script","text":"
If you pefer writing a shell or python script instead of a Go program, create a Kubernetes ConfigMap and use annotations to make sure the script is run before the VMI creation. The flow would be as below:
Create a ConfigMap containing the shell or python script you want to run
Create a VMI containing the annotation hooks.kubevirt.io/hookSidecars and mention the ConfigMap information in it.
In this case a predefined image can be used to handle the communication with the main container.
"},{"location":"user_workloads/hook-sidecar/#configmap-with-shell-script","title":"ConfigMap with shell script","text":"
"},{"location":"user_workloads/hook-sidecar/#configmap-with-python-script","title":"ConfigMap with python script","text":"
apiVersion: v1\nkind: ConfigMap\nmetadata:\n name: my-config-map\ndata:\n my_script.sh: |\n #!/usr/bin/env python3\n\n import xml.etree.ElementTree as ET\n import sys\n\n def main(s):\n # write to a temporary file\n f = open(\"/tmp/orig.xml\", \"w\")\n f.write(s)\n f.close()\n\n # parse xml from file\n xml = ET.parse(\"/tmp/orig.xml\")\n # get the root element\n root = xml.getroot()\n # find the baseBoard element\n baseBoard = root.find(\"sysinfo\").find(\"baseBoard\")\n\n # prepare new element to be inserted into the xml definition\n element = ET.Element(\"entry\", {\"name\": \"manufacturer\"})\n element.text = \"Radical Edward\"\n # insert the element\n baseBoard.insert(0, element)\n\n # write to a new file\n xml.write(\"/tmp/new.xml\")\n # print file contents to stdout\n f = open(\"/tmp/new.xml\")\n print(f.read())\n f.close()\n\n if __name__ == \"__main__\":\n main(sys.argv[4])\n
After creating one of the above ConfigMap, create the VMI using the manifest in this example. Of importance here is the ConfigMap information stored in the annotations:
The name field indicates the name of the ConfigMap on the cluster which contains the script you want to execute. The key field indicates the key in the ConfigMap which contains the script to be executed. Finally, hookPath indicates the path where you want the script to be mounted. It could be either of /usr/bin/onDefineDomain or /usr/bin/preCloudInitIso depending upon the hook you want to execute. An optional value can be specified with the \"image\" key if a custom image is needed, if omitted the default Sidecar-shim image built together with the other KubeVirt images will be used. The default Sidecar-shim image, if not override with a custom value, will also be updated as other images as for Updating KubeVirt Workloads.
Whether you used the Go binary or a Shell/Python script from the above examples, you would be able to see the newly created VMI have the modified baseboard manufacturer information. After creating the VMI, verify that it is in the Running state, and connect to its console and see if the desired changes to baseboard manufacturer get reflected:
# Once the VM is ready, connect to its display and login using name and password \"fedora\"\ncluster/virtctl.sh vnc vmi-with-sidecar-hook-configmap\n\n# Check whether the base board manufacturer value was successfully overwritten\nsudo dmidecode -s baseboard-manufacturer\n
"},{"location":"user_workloads/instancetypes/","title":"Instance types and preferences","text":"
FEATURE STATE:
instancetype.kubevirt.io/v1alpha1 (Experimental) as of the v0.56.0 KubeVirt release
instancetype.kubevirt.io/v1alpha2 (Experimental) as of the v0.58.0 KubeVirt release
instancetype.kubevirt.io/v1beta1 as of the v1.0.0 KubeVirt release
KubeVirt's VirtualMachine API contains many advanced options for tuning the performance of a VM that goes beyond what typical users need to be aware of. Users have previously been unable to simply define the storage/network they want assigned to their VM and then declare in broad terms what quality of resources and kind of performance characteristics they need for their VM.
Instance types and preferences provide a way to define a set of resource, performance and other runtime characteristics, allowing users to reuse these definitions across multiple VirtualMachines.
KubeVirt provides two CRDs for instance types, a cluster wide VirtualMachineClusterInstancetype and a namespaced VirtualMachineInstancetype. These CRDs encapsulate the following resource related characteristics of a VirtualMachine through a shared VirtualMachineInstancetypeSpec:
CPU : Required number of vCPUs presented to the guest
Memory : Required amount of memory presented to the guest
GPUs : Optional list of vGPUs to passthrough
HostDevices : Optional list of HostDevices to passthrough
IOThreadsPolicy : Optional IOThreadsPolicy to be used
LaunchSecurity: Optional LaunchSecurity to be used
Anything provided within an instance type cannot be overridden within the VirtualMachine. For example, as CPU and Memory are both required attributes of an instance type, if a user makes any requests for CPU or Memory resources within the underlying VirtualMachine, the instance type will conflict and the request will be rejected during creation.
KubeVirt also provides two further preference based CRDs, again a cluster wide VirtualMachineClusterPreference and namespaced VirtualMachinePreference. These CRDsencapsulate the preferred value of any remaining attributes of a VirtualMachine required to run a given workload, again this is through a shared VirtualMachinePreferenceSpec.
Unlike instance types, preferences only represent the preferred values and as such, they can be overridden by values in the VirtualMachine provided by the user.
In the example shown below, a user has provided a VirtualMachine with a disk bus already defined within a DiskTarget and has also selected a set of preferences with DevicePreference and preferredDiskBus , so the user's original choice within the VirtualMachine and DiskTarget are used:
$ kubectl apply -f - << EOF\n---\napiVersion: instancetype.kubevirt.io/v1beta1\nkind: VirtualMachinePreference\nmetadata:\n name: example-preference-disk-virtio\nspec:\n devices:\n preferredDiskBus: virtio\n---\napiVersion: kubevirt.io/v1\nkind: VirtualMachine\nmetadata:\n name: example-preference-user-override\nspec:\n preference:\n kind: VirtualMachinePreference\n name: example-preference-disk-virtio\n runStrategy: Halted\n template:\n spec:\n domain:\n memory:\n guest: 128Mi\n devices:\n disks:\n - disk:\n bus: sata\n name: containerdisk\n - disk: {}\n name: cloudinitdisk\n resources: {}\n terminationGracePeriodSeconds: 0\n volumes:\n - containerDisk:\n image: registry:5000/kubevirt/cirros-container-disk-demo:devel\n name: containerdisk\n - cloudInitNoCloud:\n userData: |\n #!/bin/sh\n\n echo 'printed from cloud-init userdata'\n name: cloudinitdisk\nEOF\nvirtualmachinepreference.instancetype.kubevirt.io/example-preference-disk-virtio created\nvirtualmachine.kubevirt.io/example-preference-user-override configured\n\n\n$ virtctl start example-preference-user-override\nVM example-preference-user-override was scheduled to start\n\n# We can see the original request from the user within the VirtualMachine lists `containerdisk` with a `SATA` bus\n$ kubectl get vms/example-preference-user-override -o json | jq .spec.template.spec.domain.devices.disks\n[\n {\n \"disk\": {\n \"bus\": \"sata\"\n },\n \"name\": \"containerdisk\"\n },\n {\n \"disk\": {},\n \"name\": \"cloudinitdisk\"\n }\n]\n\n# This is still the case in the VirtualMachineInstance with the remaining disk using the `preferredDiskBus` from the preference of `virtio`\n$ kubectl get vmis/example-preference-user-override -o json | jq .spec.domain.devices.disks\n[\n {\n \"disk\": {\n \"bus\": \"sata\"\n },\n \"name\": \"containerdisk\"\n },\n {\n \"disk\": {\n \"bus\": \"virtio\"\n },\n \"name\": \"cloudinitdisk\"\n }\n]\n
A preference can optionally include a PreferredCPUTopology that defines how the guest visible CPU topology of the VirtualMachineInstance is constructed from vCPUs supplied by an instance type.
The allowed values for PreferredCPUTopology include:
sockets (default) - Provides vCPUs as sockets to the guest
cores - Provides vCPUs as cores to the guest
threads - Provides vCPUs as threads to the guest
spread - Spreads vCPUs across sockets and cores by default. See the following SpreadOptions section for more details.
any - Provides vCPUs as sockets to the guest, this is also used to express that any allocation of vCPUs is required by the preference. Useful when defining a preference that isn't used alongside an instance type.
Note that support for the original preferSockets, preferCores, preferThreads and preferSpread values for PreferredCPUTopology is deprecated as of v1.4.0 ahead of removal in a future release.
When spread is provided as the value of PreferredCPUTopology we can further customize how vCPUs are spread across the guest visible CPU topology using SpreadOptions:
The previous instance type and preference CRDs are matched to a given VirtualMachine through the use of a matcher. Each matcher consists of the following:
Name (string): Name of the resource being referenced
Kind (string): Optional, defaults to the cluster wide CRD kinds of VirtualMachineClusterInstancetype or VirtualMachineClusterPreference if not provided
RevisionName (string) : Optional, name of a ControllerRevision containing a copy of the VirtualMachineInstancetypeSpec or VirtualMachinePreferenceSpec taken when the VirtualMachine is first created. See the Versioning section below for more details on how and why this is captured.
InferFromVolume (string): Optional, see the Inferring defaults from a Volume section below for more details.
"},{"location":"user_workloads/instancetypes/#creating-instancetypes-preferences-and-virtualmachines","title":"Creating InstanceTypes, Preferences and VirtualMachines","text":"
It is possible to streamline the creation of instance types, preferences, and virtual machines with the usage of the virtctl command-line tool. To read more about it, please see the Creating VirtualMachines.
Versioning of these resources is required to ensure the eventual VirtualMachineInstance created when starting a VirtualMachine does not change between restarts if any referenced instance type or set of preferences are updated during the lifetime of the VirtualMachine.
This is currently achieved by using ControllerRevision to retain a copy of the VirtualMachineInstancetype or VirtualMachinePreference at the time the VirtualMachine is created. A reference to these ControllerRevisions are then retained in the InstancetypeMatcher and PreferenceMatcher within the VirtualMachine for future use.
Users can opt in to moving to a newer generation of an instance type or preference by removing the referenced revisionName from the appropriate matcher within the VirtualMachine object. This will result in fresh ControllerRevisions being captured and used.
The following example creates a VirtualMachine using an initial version of the csmall instance type before increasing the number of vCPUs provided by the instance type:
In order for this change to be picked up within the VirtualMachine, we need to stop the running VirtualMachine and clear the revisionName referenced by the InstancetypeMatcher:
As you can see above, the InstancetypeMatcher now references a new ControllerRevision containing generation 2 of the instance type. We can now start the VirtualMachine again and see the new number of vCPUs being used by the VirtualMachineInstance:
$ virtctl start vm-cirros-csmall\nVM vm-cirros-csmall was scheduled to start\n\n$ kubectl get vmi/vm-cirros-csmall -o json | jq .spec.domain.cpu\n{\n \"cores\": 1,\n \"model\": \"host-model\",\n \"sockets\": 2,\n \"threads\": 1\n}\n
The inferFromVolume attribute of both the InstancetypeMatcher and PreferenceMatcher allows a user to request that defaults are inferred from a volume. When requested, KubeVirt will look for the following labels on the underlying PVC, DataSource or DataVolume to determine the default name and kind:
instancetype.kubevirt.io/default-instancetype
instancetype.kubevirt.io/default-instancetype-kind (optional, defaults to VirtualMachineClusterInstancetype)
instancetype.kubevirt.io/default-preference
instancetype.kubevirt.io/default-preference-kind (optional, defaults to VirtualMachineClusterPreference)
These values are then written into the appropriate matcher by the mutation webhook and used during validation before the VirtualMachine is formally accepted.
The validation can be controlled by the value provided to inferFromVolumeFailurePolicy in either the InstancetypeMatcher or PreferenceMatcher of a VirtualMachine.
The default value of Reject will cause the request to be rejected on failure to find the referenced Volume or labels on an underlying resource.
If Ignore was provided, the respective InstancetypeMatcher or PreferenceMatcher will be cleared on a failure instead.
Various examples are available within the kubevirt repo under /examples. The following uses an example VirtualMachine provided by the containerdisk/fedora repo and replaces much of the DomainSpec with the equivalent instance type and preferences:
This version captured complete VirtualMachine{Instancetype,ClusterInstancetype,Preference,ClusterPreference} objects within the created ControllerRevisions
This version is backwardly compatible with instancetype.kubevirt.io/v1alpha1.
The following instance type attribute has been added:
Spec.Memory.OvercommitPercent
The following preference attributes have been added:
Spec.CPU.PreferredCPUFeatures
Spec.Devices.PreferredInterfaceMasquerade
Spec.PreferredSubdomain
Spec.PreferredTerminationGracePeriodSeconds
Spec.Requirements
This version is backwardly compatible with instancetype.kubevirt.io/v1alpha1 and instancetype.kubevirt.io/v1alpha2 objects, no modifications are required to existing VirtualMachine{Instancetype,ClusterInstancetype,Preference,ClusterPreference} or ControllerRevisions.
As with the migration to kubevirt.io/v1 it is recommend previous users of instancetype.kubevirt.io/v1alpha1 or instancetype.kubevirt.io/v1alpha2 use kube-storage-version-migrator to upgrade any stored objects to instancetype.kubevirt.io/v1beta1.
Every VirtualMachineInstance represents a single virtual machine instance. In general, the management of VirtualMachineInstances is kept similar to how Pods are managed: Every VM that is defined in the cluster is expected to be running, just like Pods. Deleting a VirtualMachineInstance is equivalent to shutting it down, this is also equivalent to how Pods behave.
"},{"location":"user_workloads/lifecycle/#launching-a-virtual-machine","title":"Launching a virtual machine","text":"
In order to start a VirtualMachineInstance, you just need to create a VirtualMachineInstance object using kubectl:
Note: Stopping a VirtualMachineInstance implies that it will be deleted from the cluster. You will not be able to start this VirtualMachineInstance object again.
"},{"location":"user_workloads/lifecycle/#starting-and-stopping-a-virtual-machine","title":"Starting and stopping a virtual machine","text":"
Virtual machines, in contrast to VirtualMachineInstances, have a running state. Thus on VM you can define if it should be running, or not. VirtualMachineInstances are, if they are defined in the cluster, always running and consuming resources.
virtctl is used in order to start and stop a VirtualMachine:
$ virtctl start my-vm\n$ virtctl stop my-vm\n
Note: You can force stop a VM (which is like pulling the power cord, with all its implications like data inconsistencies or [in the worst case] data loss) by
$ virtctl stop my-vm --grace-period 0 --force\n
"},{"location":"user_workloads/lifecycle/#pausing-and-unpausing-a-virtual-machine","title":"Pausing and unpausing a virtual machine","text":"
Note: Pausing in this context refers to libvirt's virDomainSuspend command: \"The process is frozen without further access to CPU resources and I/O but the memory used by the domain at the hypervisor level will stay allocated\"
To pause a virtual machine, you need the virtctl command line tool. Its pause command works on either VirtualMachine s or VirtualMachinesInstance s:
$ virtctl pause vm testvm\n# OR\n$ virtctl pause vmi testvm\n
Paused VMIs have a Paused condition in their status:
$ kubectl get vmi testvm -o=jsonpath='{.status.conditions[?(@.type==\"Paused\")].message}'\nVMI was paused by user\n
Unpausing works similar to pausing:
$ virtctl unpause vm testvm\n# OR\n$ virtctl unpause vmi testvm\n
"},{"location":"user_workloads/liveness_and_readiness_probes/","title":"Liveness and Readiness Probes","text":"
It is possible to configure Liveness and Readiness Probes in a similar fashion like it is possible to configure Liveness and Readiness Probes on Containers.
Liveness Probes will effectively stop the VirtualMachineInstance if they fail, which will allow higher level controllers, like VirtualMachine or VirtualMachineInstanceReplicaSet to spawn new instances, which will hopefully be responsive again.
Readiness Probes are an indicator for Services and Endpoints if the VirtualMachineInstance is ready to receive traffic from Services. If Readiness Probes fail, the VirtualMachineInstance will be removed from the Endpoints which back services until the probe recovers.
Watchdogs focus on ensuring that an Operating System is still responsive. They complement the probes which are more workload centric. Watchdogs require kernel support from the guest and additional tooling like the commonly used watchdog binary.
Exec probes are Liveness or Readiness probes specifically intended for VMs. These probes run a command inside the VM and determine the VM ready/live state based on its success. For running commands inside the VMs, the qemu-guest-agent package is used. A command supplied to an exec probe will be wrapped by virt-probe in the operator and forwarded to the guest.
"},{"location":"user_workloads/liveness_and_readiness_probes/#define-a-http-liveness-probe","title":"Define a HTTP Liveness Probe","text":"
The following VirtualMachineInstance configures a HTTP Liveness Probe via spec.livenessProbe.httpGet, which will query port 1500 of the VirtualMachineInstance, after an initial delay of 120 seconds. The VirtualMachineInstance itself installs and runs a minimal HTTP server on port 1500 via cloud-init.
"},{"location":"user_workloads/liveness_and_readiness_probes/#define-a-tcp-liveness-probe","title":"Define a TCP Liveness Probe","text":"
The following VirtualMachineInstance configures a TCP Liveness Probe via spec.livenessProbe.tcpSocket, which will query port 1500 of the VirtualMachineInstance, after an initial delay of 120 seconds. The VirtualMachineInstance itself installs and runs a minimal HTTP server on port 1500 via cloud-init.
Note that in the case of Readiness Probes, it is also possible to set a failureThreshold and a successThreashold to only flip between ready and non-ready state if the probe succeeded or failed multiple times.
Some context is needed to understand the limitations imposed by a dual-stack network configuration on readiness - or liveness - probes. Users must be fully aware that a dual-stack configuration is currently only available when using a masquerade binding type. Furthermore, it must be recalled that accessing a VM using masquerade binding type is performed via the pod IP address; in dual-stack mode, both IPv4 and IPv6 addresses can be used to reach the VM.
Dual-stack networking configurations have a limitation when using HTTP / TCP probes - you cannot probe the VMI by its IPv6 address. The reason for this is the host field for both the HTTP and TCP probe actions default to the pod's IP address, which is currently always the IPv4 address.
Since the pod's IP address is not known before creating the VMI, it is not possible to pre-provision the probe's host field.
"},{"location":"user_workloads/liveness_and_readiness_probes/#defining-a-watchdog","title":"Defining a Watchdog","text":"
A watchdog is a more VM centric approach where the responsiveness of the Operating System is focused on. One can configure the i6300esb watchdog device:
The example above configures it with the poweroff action. It defines what will happen if the OS can't respond anymore. Other possible actions are reset and shutdown. The VM in this example will have the device exposed as /dev/watchdog. This device can then be used by the watchdog binary. For example, if root executes this command inside the VM:
the watchdog will send a heartbeat every two seconds to /dev/watchdog and after four seconds without a heartbeat the defined action will be executed. In this case a hard poweroff.
Guest-Agent probes are based on qemu-guest-agent guest-ping. This will ping the guest and return an error if the guest is not up and running. To easily define this on VM spec, specify guestAgentPing: {} in VM's spec.template.spec.readinessProbe. virt-controller will translate this into a corresponding command wrapped by virt-probe.
Note: You can only define one of the type of probe, i.e. guest-agent exec or ping probes.
Important: If the qemu-guest-agent is not installed and enabled inside the VM, the probe will fail. Many images don't enable the agent by default so make sure you either run one that does or enable it.
Make sure to provide enough delay and failureThreshold for the VM and the agent to be online.
In the following example the Fedora image does have qemu-guest-agent available by default. Nevertheless, in case qemu-guest-agent is not installed, it will be installed and enabled via cloud-init as shown in the example below. Also, cloud-init assigns the proper SELinux context, i.e. virt_qemu_ga_exec_t, to the /tmp/healthy.txt file. Otherwise, SELinux will deny the attempts to open the /tmp/healthy.txt file causing the probe to fail.
Note that, in the above example if SELinux is not installed in your container disk image, the command chcon should be removed from the VM manifest shown below. Otherwise, the chcon command will fail.
The .status.ready field will switch to true indicating that probes are returning successfully:
A VirtualMachinePool tries to ensure that a specified number of VirtualMachine replicas and their respective VirtualMachineInstances are in the ready state at any time. In other words, a VirtualMachinePool makes sure that a VirtualMachine or a set of VirtualMachines is always up and ready.
No state is kept and no guarantees are made about the maximum number of VirtualMachineInstance replicas running at any time. For example, the VirtualMachinePool may decide to create new replicas if possibly still running VMs are entering an unknown state.
The VirtualMachinePool allows us to specify a VirtualMachineTemplate in spec.virtualMachineTemplate. It consists of ObjectMetadata in spec.virtualMachineTemplate.metadata, and a VirtualMachineSpec in spec.virtualMachineTemplate.spec. The specification of the virtual machine is equal to the specification of the virtual machine in the VirtualMachine workload.
spec.replicas can be used to specify how many replicas are wanted. If unspecified, the default value is 1. This value can be updated anytime. The controller will react to the changes.
spec.selector is used by the controller to keep track of managed virtual machines. The selector specified there must be able to match the virtual machine labels as specified in spec.virtualMachineTemplate.metadata.labels. If the selector does not match these labels, or they are empty, the controller will simply do nothing except log an error. The user is responsible for avoiding the creation of other virtual machines or VirtualMachinePools which may conflict with the selector and the template labels.
"},{"location":"user_workloads/pool/#creating-a-virtualmachinepool","title":"Creating a VirtualMachinePool","text":"
VirtualMachinePool is part of the Kubevirt API pool.kubevirt.io/v1alpha1.
The example below shows how to create a simple VirtualMachinePool:
Saving this manifest into vm-pool-cirros.yaml and submitting it to Kubernetes will create three virtual machines based on the template.
$ kubectl create -f vm-pool-cirros.yaml\nvirtualmachinepool.pool.kubevirt.io/vm-pool-cirros created\n$ kubectl describe vmpool vm-pool-cirros\nName: vm-pool-cirros\nNamespace: default\nLabels: <none>\nAnnotations: <none>\nAPI Version: pool.kubevirt.io/v1alpha1\nKind: VirtualMachinePool\nMetadata:\n Creation Timestamp: 2023-02-09T18:30:08Z\n Generation: 1\n Manager: kubectl-create\n Operation: Update\n Time: 2023-02-09T18:30:08Z\n API Version: pool.kubevirt.io/v1alpha1\n Fields Type: FieldsV1\n fieldsV1:\n f:status:\n .:\n f:labelSelector:\n f:readyReplicas:\n f:replicas:\n Manager: virt-controller\n Operation: Update\n Subresource: status\n Time: 2023-02-09T18:30:44Z\n Resource Version: 6606\n UID: ba51daf4-f99f-433c-89e5-93f39bc9989d\nSpec:\n Replicas: 3\n Selector:\n Match Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Virtual Machine Template:\n Metadata:\n Creation Timestamp: <nil>\n Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Spec:\n Running: true\n Template:\n Metadata:\n Creation Timestamp: <nil>\n Labels:\n kubevirt.io/vmpool: vm-pool-cirros\n Spec:\n Domain:\n Devices:\n Disks:\n Disk:\n Bus: virtio\n Name: containerdisk\n Resources:\n Requests:\n Memory: 128Mi\n Termination Grace Period Seconds: 0\n Volumes:\n Container Disk:\n Image: kubevirt/cirros-container-disk-demo:latest\n Name: containerdisk\nStatus:\n Label Selector: kubevirt.io/vmpool=vm-pool-cirros\n Ready Replicas: 2\n Replicas: 3\nEvents:\n Type Reason Age From Message\n ---- ------ ---- ---- -------\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-0\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-2\n Normal SuccessfulCreate 17s virtualmachinepool-controller Created VM default/vm-pool-cirros-1\n
Replicas is 3 and Ready Replicas is 2. This means that at the moment when showing the status, three Virtual Machines were already created, but only two are running and ready.
"},{"location":"user_workloads/pool/#scaling-via-the-scale-subresource","title":"Scaling via the Scale Subresource","text":"
Note: This requires KubeVirt 0.59 or newer.
The VirtualMachinePool supports the scale subresource. As a consequence it is possible to scale it via kubectl:
"},{"location":"user_workloads/pool/#removing-a-virtualmachine-from-virtualmachinepool","title":"Removing a VirtualMachine from VirtualMachinePool","text":"
It is also possible to remove a VirtualMachine from its VirtualMachinePool.
In this scenario, the ownerReferences needs to be removed from the VirtualMachine. This can be achieved either by using kubectl edit or kubectl patch. Using kubectl patch it would look like:
kubectl patch vm vm-pool-cirros-0 --type merge --patch '{\"metadata\":{\"ownerReferences\":null}}'\n
Note: You may want to update your VirtualMachine labels as well to avoid impact on selectors.
"},{"location":"user_workloads/pool/#using-the-horizontal-pod-autoscaler","title":"Using the Horizontal Pod Autoscaler","text":"
Note: This requires KubeVirt 0.59 or newer.
The HorizontalPodAutoscaler (HPA) can be used with a VirtualMachinePool. Simply reference it in the spec of the autoscaler:
"},{"location":"user_workloads/pool/#exposing-a-virtualmachinepool-as-a-service","title":"Exposing a VirtualMachinePool as a Service","text":"
A VirtualMachinePool may be exposed as a service. When this is done, one of the VirtualMachine replicas will be picked for the actual delivery of the service.
For example, exposing SSH port (22) as a ClusterIP service:
Saving this manifest into vm-pool-cirros-ssh.yaml and submitting it to Kubernetes will create the ClusterIP service listening on port 2222 and forwarding to port 22.
Usage of a DataVolumeTemplates within a spec.virtualMachineTemplate.spec will result in the creation of unique persistent storage for each VM within a VMPool. The DataVolumeTemplate name will have the VM's sequential postfix appended to it when the VM is created from the spec.virtualMachineTemplate.spec.dataVolumeTemplates. This makes each VM a completely unique stateful workload.
"},{"location":"user_workloads/pool/#using-unique-cloudinit-and-configmap-volumes-with-virtualmachinepools","title":"Using Unique CloudInit and ConfigMap Volumes with VirtualMachinePools","text":"
By default, any secrets or configMaps references in a spec.virtualMachineTemplate.spec.template Volume section will be used directly as is, without any modification to the naming. This means if you specify a secret in a CloudInitNoCloud volume, that every VM instance spawned from the VirtualMachinePool with this volume will get the exact same secret used for their cloud-init user data.
This default behavior can be modified by setting the AppendPostfixToSecretReferences and AppendPostfixToConfigMapReferences booleans to true on the VMPool spec. When these booleans are enabled, references to secret and configMap names will have the VM's sequential postfix appended to the secret and configmap name. This allows someone to pre-generate unique per VM secret and configMap data for a VirtualMachinePool ahead of time in a way that will be predictably assigned to VMs within the VirtualMachinePool.
VirtualMachineInstancePresets are deprecated as of the v0.57.0 release and will be removed in a future release.
Users should instead look to use Instancetypes and preferences as a replacement.
VirtualMachineInstancePresets are an extension to general VirtualMachineInstance configuration behaving much like PodPresets from Kubernetes. When a VirtualMachineInstance is created, any applicable VirtualMachineInstancePresets will be applied to the existing spec for the VirtualMachineInstance. This allows for re-use of common settings that should apply to multiple VirtualMachineInstances.
"},{"location":"user_workloads/presets/#create-a-virtualmachineinstancepreset","title":"Create a VirtualMachineInstancePreset","text":"
You can describe a VirtualMachineInstancePreset in a YAML file. For example, the vmi-preset.yaml file below describes a VirtualMachineInstancePreset that requests a VirtualMachineInstance be created with a resource request for 64M of RAM.
As with most Kubernetes resources, a VirtualMachineInstancePreset requires apiVersion, kind and metadata fields.
Additionally VirtualMachineInstancePresets also need a spec section. While not technically required to satisfy syntax, it is strongly recommended to include a Selector in the spec section, otherwise a VirtualMachineInstancePreset will match all VirtualMachineInstances in a namespace.
KubeVirt uses Kubernetes Labels and Selectors to determine which VirtualMachineInstancePresets apply to a given VirtualMachineInstance, similarly to how PodPresets work in Kubernetes. If a setting from a VirtualMachineInstancePreset is applied to a VirtualMachineInstance, the VirtualMachineInstance will be marked with an Annotation upon completion.
Any domain structure can be listed in the spec of a VirtualMachineInstancePreset, e.g. Clock, Features, Memory, CPU, or Devices such as network interfaces. All elements of the spec section of a VirtualMachineInstancePreset will be applied to the VirtualMachineInstance.
Once a VirtualMachineInstancePreset is successfully applied to a VirtualMachineInstance, the VirtualMachineInstance will be marked with an annotation to indicate that it was applied. If a conflict occurs while a VirtualMachineInstancePreset is being applied, that portion of the VirtualMachineInstancePreset will be skipped.
Any valid Label can be matched against, but it is suggested that a general rule of thumb is to use os/shortname, e.g. kubevirt.io/os: rhel7.
"},{"location":"user_workloads/presets/#updating-a-virtualmachineinstancepreset","title":"Updating a VirtualMachineInstancePreset","text":"
If a VirtualMachineInstancePreset is modified, changes will not be applied to existing VirtualMachineInstances. This applies to both the Selector indicating which VirtualMachineInstances should be matched, and also the Domain section which lists the settings that should be applied to a VirtualMachine.
VirtualMachineInstancePresets use a similar conflict resolution strategy to Kubernetes PodPresets. If a portion of the domain spec is present in both a VirtualMachineInstance and a VirtualMachineInstancePreset and both resources have the identical information, then creation of the VirtualMachineInstance will continue normally. If however there is a difference between the resources, an Event will be created indicating which DomainSpec element of which VirtualMachineInstancePreset was overridden. For example: If both the VirtualMachineInstance and VirtualMachineInstancePreset define a CPU, but use a different number of Cores, KubeVirt will note the difference.
If any settings from the VirtualMachineInstancePreset were successfully applied, the VirtualMachineInstance will be annotated.
In the event that there is a difference between the Domains of a VirtualMachineInstance and VirtualMachineInstancePreset, KubeVirt will create an Event. kubectl get events can be used to show all Events. For example:
$ kubectl get events\n ....\n Events:\n FirstSeen LastSeen Count From SubobjectPath Reason Message\n 2m 2m 1 myvmi.1515bbb8d397f258 VirtualMachineInstance Warning Conflict virtualmachineinstance-preset-controller Unable to apply VirtualMachineInstancePreset 'example-preset': spec.cpu: &{6} != &{4}\n
When multiple VirtualMachineInstancePresets match a particular VirtualMachineInstance, if they specify the same settings within a Domain, those settings must match. If two VirtualMachineInstancePresets have conflicting settings (e.g. for the number of CPU cores requested), an error will occur, and the VirtualMachineInstance will enter the Failed state, and a Warning event will be emitted explaining which settings of which VirtualMachineInstancePresets were problematic.
The main use case for VirtualMachineInstancePresets is to create re-usable settings that can be applied across various machines. Multiple methods are available to match the labels of a VirtualMachineInstance using selectors.
matchLabels: Each VirtualMachineInstance can use a specific label shared by all
instances. * matchExpressions: Logical operators for sets can be used to match multiple
labels.
Using matchLabels, the label used in the VirtualMachineInstancePreset must match one of the labels of the VirtualMachineInstance:
Since VirtualMachineInstancePresets use Selectors that indicate which VirtualMachineInstances their settings should apply to, there needs to exist a mechanism by which VirtualMachineInstances can opt out of VirtualMachineInstancePresets altogether. This is done using an annotation:
This is an example of a merge conflict. In this case both the VirtualMachineInstance and VirtualMachineInstancePreset request different number of CPU's.
"},{"location":"user_workloads/presets/#matching-multiple-virtualmachineinstances-using-matchlabels","title":"Matching Multiple VirtualMachineInstances Using MatchLabels","text":"
These VirtualMachineInstances have multiple labels, one that is unique and one that is shared.
Note: This example breaks from the convention of using os-shortname as a Label for demonstration purposes.
"},{"location":"user_workloads/presets/#matching-multiple-virtualmachineinstances-using-matchexpressions","title":"Matching Multiple VirtualMachineInstances Using MatchExpressions","text":"
This VirtualMachineInstancePreset has a matchExpression that will match two labels: kubevirt.io/os: win10 and kubevirt.io/os: win7.
A VirtualMachineInstanceReplicaSet tries to ensures that a specified number of VirtualMachineInstance replicas are running at any time. In other words, a VirtualMachineInstanceReplicaSet makes sure that a VirtualMachineInstance or a homogeneous set of VirtualMachineInstances is always up and ready. It is very similar to a Kubernetes ReplicaSet.
No state is kept and no guarantees about the maximum number of VirtualMachineInstance replicas which are up are given. For example, the VirtualMachineInstanceReplicaSet may decide to create new replicas if possibly still running VMs are entering an unknown state.
The VirtualMachineInstanceReplicaSet allows us to specify a VirtualMachineInstanceTemplate in spec.template. It consists of ObjectMetadata in spec.template.metadata, and a VirtualMachineInstanceSpec in spec.template.spec. The specification of the virtual machine is equal to the specification of the virtual machine in the VirtualMachineInstance workload.
spec.replicas can be used to specify how many replicas are wanted. If unspecified, the default value is 1. This value can be updated anytime. The controller will react to the changes.
spec.selector is used by the controller to keep track of managed virtual machines. The selector specified there must be able to match the virtual machine labels as specified in spec.template.metadata.labels. If the selector does not match these labels, or they are empty, the controller will simply do nothing except from logging an error. The user is responsible for not creating other virtual machines or VirtualMachineInstanceReplicaSets which conflict with the selector and the template labels.
"},{"location":"user_workloads/replicaset/#exposing-a-virtualmachineinstancereplicaset-as-a-service","title":"Exposing a VirtualMachineInstanceReplicaSet as a Service","text":"
A VirtualMachineInstanceReplicaSet could be exposed as a service. When this is done, one of the VirtualMachineInstances replicas will be picked for the actual delivery of the service.
For example, exposing SSH port (22) as a ClusterIP service using virtctl on a VirtualMachineInstanceReplicaSet:
All service exposure options that apply to a VirtualMachineInstance apply to a VirtualMachineInstanceReplicaSet. See Exposing VirtualMachineInstance for more details.
"},{"location":"user_workloads/replicaset/#when-to-use-a-virtualmachineinstancereplicaset","title":"When to use a VirtualMachineInstanceReplicaSet","text":"
Note: The base assumption is that referenced disks are read-only or that the VMIs are writing internally to a tmpfs. The most obvious volume sources for VirtualMachineInstanceReplicaSets which KubeVirt supports are referenced below. If other types are used data corruption is possible.
Using VirtualMachineInstanceReplicaSet is the right choice when one wants many identical VMs and does not care about maintaining any disk state after the VMs are terminated.
Volume types which work well in combination with a VirtualMachineInstanceReplicaSet are:
cloudInitNoCloud
ephemeral
containerDisk
emptyDisk
configMap
secret
any other type, if the VMI writes internally to a tmpfs
This use-case involves small and fast booting VMs with little provisioning performed during initialization.
In this scenario, migrations are not important. Redistributing VM workloads between Nodes can be achieved simply by deleting managed VirtualMachineInstances which are running on an overloaded Node. The eviction of such a VirtualMachineInstance can happen by directly deleting the VirtualMachineInstance instance (KubeVirt aware workload redistribution) or by deleting the corresponding Pod where the Virtual Machine runs in (Only Kubernetes aware workload redistribution).
In this use-case one has big and slow booting VMs, and complex or resource intensive provisioning is done during boot. More specifically, the timespan between the creation of a new VM and it entering the ready state is long.
In this scenario, one still does not care about the state, but since re-provisioning VMs is expensive, migrations are important. Workload redistribution between Nodes can be achieved by migrating VirtualMachineInstances to different Nodes. A workload redistributor needs to be aware of KubeVirt and create migrations, instead of evicting VirtualMachineInstances by deletion.
Note: The simplest form of having a migratable ephemeral VirtualMachineInstance, will be to use local storage based on ContainerDisks in combination with a file based backing store. However, migratable backing store support has not officially landed yet in KubeVirt and is untested.
Replicas is 3 and Ready Replicas is 2. This means that at the moment when showing the status, three Virtual Machines were already created, but only two are running and ready.
"},{"location":"user_workloads/replicaset/#scaling-via-the-scale-subresource","title":"Scaling via the Scale Subresource","text":"
Note: This requires the CustomResourceSubresources feature gate to be enabled for clusters prior to 1.11.
The VirtualMachineInstanceReplicaSet supports the scale subresource. As a consequence it is possible to scale it via kubectl:
$ kubectl scale vmirs myvmirs --replicas 5\n
"},{"location":"user_workloads/replicaset/#using-the-horizontal-pod-autoscaler","title":"Using the Horizontal Pod Autoscaler","text":"
Note: This requires at cluster newer or equal to 1.11.
The HorizontalPodAutoscaler (HPA) can be used with a VirtualMachineInstanceReplicaSet. Simply reference it in the spec of the autoscaler:
KubeVirt supports the ability to assign a startup script to a VirtualMachineInstance instance which is executed automatically when the VM initializes.
These scripts are commonly used to automate injection of users and SSH keys into VMs in order to provide remote access to the machine. For example, a startup script can be used to inject credentials into a VM that allows an Ansible job running on a remote host to access and provision the VM.
Startup scripts are not limited to any specific use case though. They can be used to run any arbitrary script in a VM on boot.
cloud-init is a widely adopted project used for early initialization of a VM. Used by cloud providers such as AWS and GCP, cloud-init has established itself as the defacto method of providing startup scripts to VMs.
Cloud-init documentation can be found here: Cloud-init Documentation.
KubeVirt supports cloud-init's NoCloud and ConfigDrive datasources which involve injecting startup scripts into a VM instance through the use of an ephemeral disk. VMs with the cloud-init package installed will detect the ephemeral disk and execute custom userdata scripts at boot.
Ignition is an alternative to cloud-init which allows for configuring the VM disk on first boot. You can find the Ignition documentation here. You can also find a comparison between cloud-init and Ignition here.
Ignition can be used with Kubevirt by using the cloudInitConfigDrive volume.
We need to make sure the base vm does not restart, which can be done by setting the vm run strategy as RerunOnFailure.
VM runStrategy:
spec:\n runStrategy: RerunOnFailure\n
More information can be found here:
Sysprep Process Overview
Sysprep (Generalize) a Windows installation
Note
It is important that there is no answer file detected when the Sysprep Tool is triggered, because Windows Setup searches for answer files at the beginning of each configuration pass and caches it. If that happens, when the OS will start - it will just use the cached answer file, ignoring the one we provide through the Sysprep API. More information can be found here.
Providing an Answer file named autounattend.xml in an attached media. The answer file can be provided in a ConfigMap or a Secret with the key autounattend.xml
The configuration file can be generated with Windows SIM or it can be specified manually according to the information found here:
Answer files (unattend.xml)
Answer File Reference
Answer File Components Reference
Note
There are also many easy to find online tools available for creating an answer file.
KubeVirt supports the cloud-init NoCloud and ConfigDrive data sources which involve injecting startup scripts through the use of a disk attached to the VM.
In order to assign a custom userdata script to a VirtualMachineInstance using this method, users must define a disk and a volume for the NoCloud or ConfigDrive datasource in the VirtualMachineInstance's spec.
Under most circumstances users should stick to the NoCloud data source as it is the simplest cloud-init data source. Only if NoCloud is not supported by the cloud-init implementation (e.g. coreos-cloudinit) users should switch the data source to ConfigDrive.
Switching the cloud-init data source to ConfigDrive is as easy as changing the volume type in the VirtualMachineInstance's spec from cloudInitNoCloud to cloudInitConfigDrive.
Note The MAC address of the secondary interface should be predefined and identical in the network interface and the cloud-init networkData.
See the examples below for more complete cloud-init examples.
"},{"location":"user_workloads/startup_scripts/#cloud-init-user-data-as-clear-text","title":"Cloud-init user-data as clear text","text":"
In the example below, a SSH key is stored in the cloudInitNoCloud Volume's userData field as clean text. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
# Create a VM manifest with the startup script\n# a cloudInitNoCloud volume's userData field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userData: |\n #cloud-config\n ssh_authorized_keys:\n - ssh-rsa AAAAB3NzaK8L93bWxnyp test@test.com\n\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-user-data-as-base64-string","title":"Cloud-init user-data as base64 string","text":"
In the example below, a simple bash script is base64 encoded and stored in the cloudInitNoCloud Volume's userDataBase64 field. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
Users also have the option of storing the startup script in a Kubernetes Secret and referencing the Secret in the VM's spec. Examples further down in the document illustrate how that is done.
# Create a simple startup script\n\ncat << END > startup-script.sh\n#!/bin/bash\necho \"Hi from startup script!\"\nEND\n\n# Create a VM manifest with the startup script base64 encoded into\n# a cloudInitNoCloud volume's userDataBase64 field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script.sh | base64 -w0)\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-userdata-as-k8s-secret","title":"Cloud-init UserData as k8s Secret","text":"
Users who wish to not store the cloud-init userdata directly in the VirtualMachineInstance spec have the option to store the userdata into a Kubernetes Secret and reference that Secret in the spec.
Multiple VirtualMachineInstance specs can reference the same Kubernetes Secret containing cloud-init userdata.
Below is an example of how to create a Kubernetes Secret containing a startup script and reference that Secret in the VM's spec.
# Create a simple startup script\n\ncat << END > startup-script.sh\n#!/bin/bash\necho \"Hi from startup script!\"\nEND\n\n# Store the startup script in a Kubernetes Secret\nkubectl create secret generic my-vmi-secret --from-file=userdata=startup-script.sh\n\n# Create a VM manifest and reference the Secret's name in the cloudInitNoCloud\n# Volume's secretRef field\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n disk:\n bus: virtio\n - name: cloudinitdisk\n disk:\n bus: virtio\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/cirros-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n secretRef:\n name: my-vmi-secret\nEND\n\n# Post the VM\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#injecting-ssh-keys-with-cloud-inits-cloud-config","title":"Injecting SSH keys with Cloud-init's Cloud-config","text":"
In the examples so far, the cloud-init userdata script has been a bash script. Cloud-init has it's own configuration that can handle some common tasks such as user creation and SSH key injection.
More cloud-config examples can be found here: Cloud-init Examples
Below is an example of using cloud-config to inject an SSH key for the default user (fedora in this case) of a Fedora Atomic disk image.
# Create the cloud-init cloud-config userdata.\ncat << END > startup-script\n#cloud-config\npassword: atomic\nchpasswd: { expire: False }\nssh_pwauth: False\nssh_authorized_keys:\n - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6zdgFiLr1uAK7PdcchDd+LseA5fEOcxCCt7TLlr7Mx6h8jUg+G+8L9JBNZuDzTZSF0dR7qwzdBBQjorAnZTmY3BhsKcFr8Gt4KMGrS6r3DNmGruP8GORvegdWZuXgASKVpXeI7nCIjRJwAaK1x+eGHwAWO9Z8ohcboHbLyffOoSZDSIuk2kRIc47+ENRjg0T6x2VRsqX27g6j4DfPKQZGk0zvXkZaYtr1e2tZgqTBWqZUloMJK8miQq6MktCKAS4VtPk0k7teQX57OGwD6D7uo4b+Cl8aYAAwhn0hc0C2USfbuVHgq88ESo2/+NwV4SQcl3sxCW21yGIjAGt4Hy7J fedora@localhost.localdomain\nEND\n\n# Create the VM spec\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: sshvmi\nspec:\n terminationGracePeriodSeconds: 0\n domain:\n resources:\n requests:\n memory: 1024M\n devices:\n disks:\n - name: containerdisk\n disk:\n dev: vda\n - name: cloudinitdisk\n disk:\n dev: vdb\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/fedora-atomic-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script | base64 -w0)\nEND\n\n# Post the VirtualMachineInstance spec to KubeVirt.\nkubectl create -f my-vmi.yaml\n\n# Connect to VM with passwordless SSH key\nssh -i <insert private key here> fedora@<insert ip here>\n
"},{"location":"user_workloads/startup_scripts/#inject-ssh-key-using-a-custom-shell-script","title":"Inject SSH key using a Custom Shell Script","text":"
Depending on the boot image in use, users may have a mixed experience using cloud-init's cloud-config to create users and inject SSH keys.
Below is an example of creating a user and injecting SSH keys for that user using a script instead of cloud-config.
cat << END > startup-script.sh\n#!/bin/bash\nexport NEW_USER=\"foo\"\nexport SSH_PUB_KEY=\"ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6zdgFiLr1uAK7PdcchDd+LseA5fEOcxCCt7TLlr7Mx6h8jUg+G+8L9JBNZuDzTZSF0dR7qwzdBBQjorAnZTmY3BhsKcFr8Gt4KMGrS6r3DNmGruP8GORvegdWZuXgASKVpXeI7nCIjRJwAaK1x+eGHwAWO9Z8ohcboHbLyffOoSZDSIuk2kRIc47+ENRjg0T6x2VRsqX27g6j4DfPKQZGk0zvXkZaYtr1e2tZgqTBWqZUloMJK8miQq6MktCKAS4VtPk0k7teQX57OGwD6D7uo4b+Cl8aYAAwhn0hc0C2USfbuVHgq88ESo2/+NwV4SQcl3sxCW21yGIjAGt4Hy7J $NEW_USER@localhost.localdomain\"\n\nsudo adduser -U -m $NEW_USER\necho \"$NEW_USER:atomic\" | chpasswd\nsudo mkdir /home/$NEW_USER/.ssh\nsudo echo \"$SSH_PUB_KEY\" > /home/$NEW_USER/.ssh/authorized_keys\nsudo chown -R ${NEW_USER}: /home/$NEW_USER/.ssh\nEND\n\n# Create the VM spec\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1\nkind: VirtualMachineInstance\nmetadata:\n name: sshvmi\nspec:\n terminationGracePeriodSeconds: 0\n domain:\n resources:\n requests:\n memory: 1024M\n devices:\n disks:\n - name: containerdisk\n disk:\n dev: vda\n - name: cloudinitdisk\n disk:\n dev: vdb\n volumes:\n - name: containerdisk\n containerDisk:\n image: kubevirt/fedora-atomic-registry-disk-demo:latest\n - name: cloudinitdisk\n cloudInitNoCloud:\n userDataBase64: $(cat startup-script.sh | base64 -w0)\nEND\n\n# Post the VirtualMachineInstance spec to KubeVirt.\nkubectl create -f my-vmi.yaml\n\n# Connect to VM with passwordless SSH key\nssh -i <insert private key here> foo@<insert ip here>\n
A cloud-init network version 1 configuration can be set to configure the network at boot.
Cloud-init user-data must be set for cloud-init to parse network-config even if it is just the user-data config header:
#cloud-config\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-clear-text","title":"Cloud-init network-config as clear text","text":"
In the example below, a simple cloud-init network-config is stored in the cloudInitNoCloud Volume's networkData field as clean text. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
# Create a VM manifest with the network-config in\n# a cloudInitNoCloud volume's networkData field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkData: |\n network:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\n\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-base64-string","title":"Cloud-init network-config as base64 string","text":"
In the example below, a simple network-config is base64 encoded and stored in the cloudInitNoCloud Volume's networkDataBase64 field. There is a corresponding disks entry that references the cloud-init volume and assigns it to the VM's device.
Users also have the option of storing the network-config in a Kubernetes Secret and referencing the Secret in the VM's spec. Examples further down in the document illustrate how that is done.
# Create a simple network-config\n\ncat << END > network-config\nnetwork:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\nEND\n\n# Create a VM manifest with the networkData base64 encoded into\n# a cloudInitNoCloud volume's networkDataBase64 field.\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-container-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkDataBase64: $(cat network-config | base64 -w0)\nEND\n\n# Post the Virtual Machine spec to KubeVirt.\n\nkubectl create -f my-vmi.yaml\n
"},{"location":"user_workloads/startup_scripts/#cloud-init-network-config-as-k8s-secret","title":"Cloud-init network-config as k8s Secret","text":"
Users who wish to not store the cloud-init network-config directly in the VirtualMachineInstance spec have the option to store the network-config into a Kubernetes Secret and reference that Secret in the spec.
Multiple VirtualMachineInstance specs can reference the same Kubernetes Secret containing cloud-init network-config.
Below is an example of how to create a Kubernetes Secret containing a network-config and reference that Secret in the VM's spec.
# Create a simple network-config\n\ncat << END > network-config\nnetwork:\n version: 1\n config:\n - type: physical\n name: eth0\n subnets:\n - type: dhcp\nEND\n\n# Store the network-config in a Kubernetes Secret\nkubectl create secret generic my-vmi-secret --from-file=networkdata=network-config\n\n# Create a VM manifest and reference the Secret's name in the cloudInitNoCloud\n# Volume's secretRef field\n\ncat << END > my-vmi.yaml\napiVersion: kubevirt.io/v1alpha2\nkind: VirtualMachineInstance\nmetadata:\n name: myvmi\nspec:\n terminationGracePeriodSeconds: 5\n domain:\n resources:\n requests:\n memory: 64M\n devices:\n disks:\n - name: containerdisk\n volumeName: registryvolume\n disk:\n bus: virtio\n - name: cloudinitdisk\n volumeName: cloudinitvolume\n disk:\n bus: virtio\n volumes:\n - name: registryvolume\n containerDisk:\n image: kubevirt/cirros-registry-disk-demo:latest\n - name: cloudinitvolume\n cloudInitNoCloud:\n userData: \"#cloud-config\"\n networkDataSecretRef:\n name: my-vmi-secret\nEND\n\n# Post the VM\nkubectl create -f my-vmi.yaml\n
Depending on the operating system distribution in use, cloud-init output is often printed to the console output on boot up. When developing userdata scripts, users can connect to the VM's console during boot up to debug.
Example of connecting to console using virtctl:
virtctl console <name of vmi>\n
"},{"location":"user_workloads/startup_scripts/#device-role-tagging","title":"Device Role Tagging","text":"
KubeVirt provides a mechanism for users to tag devices such as Network Interfaces with a specific role. The tag will be matched to the hardware address of the device and this mapping exposed to the guest OS via cloud-init.
This additional metadata will help the guest OS users with multiple networks interfaces to identify the devices that may have a specific role, such as a network device dedicated to a specific service or a disk intended to be used by a specific application (database, webcache, etc.)
This functionality already exists in platforms such as OpenStack. KubeVirt will provide the data in a similar format, known to users and services like cloud-init.
"},{"location":"user_workloads/startup_scripts/#sysprep-examples","title":"Sysprep Examples","text":""},{"location":"user_workloads/startup_scripts/#sysprep-in-a-configmap","title":"Sysprep in a ConfigMap","text":"
In the example below, a configMap with autounattend.xml file is used to modify the Windows iso image which is downloaded from Microsoft and creates a base installed Windows machine with virtio drivers installed and all the commands executed in post-install.ps1 For the below manifests to work it needs to have win10-iso DataVolume.
"},{"location":"user_workloads/startup_scripts/#launching-a-vm-from-template","title":"Launching a VM from template","text":"
From the above example after the sysprep command is executed in the post-install.ps1 and the vm is in shutdown state, A new VM can be launched from the base win10-template with additional changes mentioned from the below unattend.xml in sysprep-config. The new VM can take upto 5 minutes to be in running state since Windows goes through oobe setup in the background with the customizations specified in the below unattend.xml file.
By deploying KubeVirt on top of OpenShift the user can benefit from the OpenShift Template functionality.
"},{"location":"user_workloads/templates/#virtual-machine-templates","title":"Virtual machine templates","text":""},{"location":"user_workloads/templates/#what-is-a-virtual-machine-template","title":"What is a virtual machine template?","text":"
The KubeVirt projects provides a set of templates to create VMs to handle common usage scenarios. These templates provide a combination of some key factors that could be further customized and processed to have a Virtual Machine object. The key factors which define a template are
Workload Most Virtual Machine should be server or desktop to have maximum flexibility; the highperformance workload trades some of this flexibility to provide better performances.
Guest Operating System (OS) This allow to ensure that the emulated hardware is compatible with the guest OS. Furthermore, it allows to maximize the stability of the VM, and allows performance optimizations.
Size (flavor) Defines the amount of resources (CPU, memory) to allocate to the VM.
More documentation is available in the common templates subproject
"},{"location":"user_workloads/templates/#accessing-the-virtual-machine-templates","title":"Accessing the virtual machine templates","text":"
If you installed KubeVirt using a supported method you should find the common templates preinstalled in the cluster. Should you want to upgrade the templates, or install them from scratch, you can use one of the supported releases
You can edit the fields of the templates which define the amount of resources which the VMs will receive.
Each template can list a different set of fields that are to be considered editable. The fields are used as hints for the user interface, and also for other components in the cluster.
The editable fields are taken from annotations in the template. Here is a snippet presenting a couple of most commonly found editable fields:
Each entry in the editable field list must be a jsonpath. The jsonpath root is the objects: element of the template. The actually editable field is the last entry (the \"leaf\") of the path. For example, the following minimal snippet highlights the fields which you can edit:
objects:\n spec:\n template:\n spec:\n domain:\n cpu:\n sockets:\n VALUE # this is editable\n cores:\n VALUE # this is editable\n threads:\n VALUE # this is editable\n resources:\n requests:\n memory:\n VALUE # this is editable\n
"},{"location":"user_workloads/templates/#relationship-between-templates-and-vms","title":"Relationship between templates and VMs","text":"
Once processed the templates produce VM objects to be used in the cluster. The VMs produced from templates will have a vm.kubevirt.io/template label, whose value will be the name of the parent template, for example fedora-desktop-medium:
In addition, these VMs can include an optional label vm.kubevirt.io/template-namespace, whose value will be the namespace of the parent template, for example:
Please note that after the generation step VM and template objects have no relationship with each other besides the aforementioned label. Changes in templates do not automatically affect VMs or vice versa.
The templates provided by the kubevirt project provide a set of conventions and annotations that augment the basic feature of the openshift templates. You can customize your kubevirt-provided templates editing these annotations, or you can add them to your existing templates to make them consumable by the kubevirt services.
Here's a description of the kubevirt annotations. Unless otherwise specified, the following keys are meant to be top-level entries of the template metadata, like
All the following annotations are prefixed with defaults.template.kubevirt.io, which is omitted below for brevity. So the actual annotations you should use will look like
Unless otherwise specified, all annotations are meant to be safe defaults, both for performance and compatibility, and hints for the CNV-aware UI and tooling.
The default values for network, nic, volume, disk are meant to be the name of a section later in the document that the UI will find and consume to find the default values for the corresponding types. For example, considering the annotation defaults.template.kubevirt.io/disk: my-disk: we assume that later in the document it exists an element called my-disk that the UI can use to find the data it needs. The names actually don't matter as long as they are legal for kubernetes and consistent with the content of the document.
The KubeVirt projects provides a set of templates to create VMs to handle common usage scenarios. These templates provide a combination of some key factors that could be further customized and processed to have a Virtual Machine object.
The key factors which define a template are - Workload Most Virtual Machine should be server or desktop to have maximum flexibility; the highperformance workload trades some of this flexibility to provide better performances. - Guest Operating System (OS) This allow to ensure that the emulated hardware is compatible with the guest OS. Furthermore, it allows to maximize the stability of the VM, and allows performance optimizations. - Size (flavor) Defines the amount of resources (CPU, memory) to allocate to the VM.
VMs can be created through OpenShift Cluster Console UI . This UI supports creation VM using templates and templates features - flavors and workload profiles. To create VM from template, choose WorkLoads in the left panel >> choose Virtualization >> press to the \"Create Virtual Machine\" blue button >> choose \"Create from wizard\". Next, you have to see \"Create Virtual Machine\" window
There is the common-templates subproject. It provides official prepared and useful templates. You can also create templates by hand. You can find an example below, in the \"Example template\" section.
Note that the template above defines free parameters (NAME, SRC_PVC_NAME, SRC_PVC_NAMESPACE, CLOUD_USER_PASSWORD) and the NAME parameter does not have specified default value.
An OpenShift template has to be converted into the JSON file via oc process command, that also allows you to set the template parameters.
A complete example can be found in the KubeVirt repository.
The command above results in creating a Kubernetes object according to the specification given by the template \\(in this example it is an instance of the VirtualMachine object\\).
It's possible to get list of available parameters using the following command:
$ oc process -f dist/templates/fedora-desktop-large.yaml --parameters\nNAME DESCRIPTION GENERATOR VALUE\nNAME VM name expression fedora-[a-z0-9]{16}\nSRC_PVC_NAME Name of the PVC to clone fedora\nSRC_PVC_NAMESPACE Namespace of the source PVC kubevirt-os-images\nCLOUD_USER_PASSWORD Randomized password for the cloud-init user fedora expression [a-z0-9]{4}-[a-z0-9]{4}-[a-z0-9]{4}\n
"},{"location":"user_workloads/templates/#starting-virtual-machine-from-the-created-object","title":"Starting virtual machine from the created object","text":"
The created object is now a regular VirtualMachine object and from now it can be controlled by accessing Kubernetes API resources. The preferred way how to do this from within the OpenShift environment is to use oc patch command.
Do not forget about virtctl tool. Using it in the real cases instead of using kubernetes API can be more convenient. Example:
$ virtctl start testvm\nVM testvm was scheduled to start\n
As soon as VM starts, Kubernetes creates new type of object - VirtualMachineInstance. It has similar name to VirtualMachine. Example (not full output, it's too big):
"},{"location":"user_workloads/templates/#cloud-init-script-and-parameters","title":"Cloud-init script and parameters","text":"
Kubevirt VM templates, just like kubevirt VM/VMI yaml configs, supports cloud-init scripts
"},{"location":"user_workloads/templates/#hack-use-pre-downloaded-image","title":"Hack - use pre-downloaded image","text":"
Kubevirt VM templates, just like kubevirt VM/VMI yaml configs, can use pre-downloaded VM image, which can be a useful feature especially in the debug/development/testing cases. No special parameters required in the VM template or VM/VMI yaml config. The main idea is to create Kubernetes PersistentVolume and PersistentVolumeClaim corresponding to existing image in the file system. Example:
Kubevirt VM templates are using dataVolumeTemplates. Before using dataVolumes, CDI has to be installed in cluster. After that, source Datavolume can be created.
You can follow Virtual Machine Lifecycle Guide for further reference.
"},{"location":"user_workloads/virtctl_client_tool/","title":"Download and Install the virtctl Command Line Interface","text":""},{"location":"user_workloads/virtctl_client_tool/#download-the-virtctl-client-tool","title":"Download the virtctl client tool","text":"
Basic VirtualMachineInstance operations can be performed with the stock kubectl utility. However, the virtctl binary utility is required to use advanced features such as:
Serial and graphical console access
It also provides convenience commands for:
Starting and stopping VirtualMachineInstances
Live migrating VirtualMachineInstances and canceling live migrations
Uploading virtual machine disk images
There are two ways to get it:
the most recent version of the tool can be retrieved from the official release page
it can be installed as a kubectl plugin using krew
This example uses a fedora cloud image in combination with cloud-init and an ephemeral empty disk with a capacity of 2Gi. For the sake of simplicity, the volume sources in this example are ephemeral and don't require a provisioner in your cluster.
In KubeVirt, the VM rollout strategy defines how changes to a VM object affect a running guest. In other words, it defines when and how changes to a VM object get propagated to its corresponding VMI object.
There are currently 2 rollout strategies: LiveUpdate and Stage. Only 1 can be specified and the default is Stage.
As long as the VMLiveUpdateFeatures is not enabled, the VM Rollout Strategy is ignored and defaults to \"Stage\". The feature gate is set in the KubeVirt custom resource (CR) like that:
The LiveUpdate VM rollout strategy tries to propagate VM object changes to running VMIs as soon as possible. For example, changing the number of CPU sockets will trigger a CPU hotplug.
Enable the LiveUpdate VM rollout strategy in the KubeVirt CR:
Any change made to a VM object when the rollout strategy is Stage will trigger the RestartRequired VM condition. When the rollout strategy is LiveUpdate, only non-propagatable changes will trigger the condition.
Once the RestartRequired condition is set on a VM object, no further changes can be propagated, even if the strategy is set to LiveUpdate. Changes will become effective on next reboot, and the condition will be removed.
The current implementation has the following limitations:
Once the RestartRequired condition is set, the only way to get rid of it is to restart the VM. In the future, we plan on implementing a way to get rid of it by reverting the VM template spec to its last non-RestartRequired state.
Cluster defaults are excluded from this logic. It means that changing a cluster-wide setting that impacts VM specs will not be live-updated, regardless of the rollout strategy.
The RestartRequired condition comes with a message stating what kind of change triggered the condition (CPU/memory/other). That message pertains only to the first change that triggered the condition. Additional changes that would usually trigger the condition will just get staged and no additional RestartRequired condition will be added.
Purpose of this document is to explain how to install virtio drivers for Microsoft Windows running in a fully virtualized guest.
"},{"location":"user_workloads/windows_virtio_drivers/#do-i-need-virtio-drivers","title":"Do I need virtio drivers?","text":"
Yes. Without the virtio drivers, you cannot use paravirtualized hardware properly. It would either not work, or will have a severe performance penalty.
For more information about VirtIO and paravirtualization, see VirtIO and paravirtualization
For more details on configuring your VirtIO driver please refer to Installing VirtIO driver on a new Windows virtual machine and Installing VirtIO driver on an existing Windows virtual machine.
"},{"location":"user_workloads/windows_virtio_drivers/#which-drivers-i-need-to-install","title":"Which drivers I need to install?","text":"
There are usually up to 8 possible devices that are required to run Windows smoothly in a virtualized environment. KubeVirt currently supports only:
viostor, the block driver, applies to SCSI Controller in the Other devices group.
viorng, the entropy source driver, applies to PCI Device in the Other devices group.
NetKVM, the network driver, applies to Ethernet Controller in the Other devices group. Available only if a virtio NIC is configured.
Other virtio drivers, that exists and might be supported in the future:
Balloon, the balloon driver, applies to PCI Device in the Other devices group
vioserial, the paravirtual serial driver, applies to PCI Simple Communications Controller in the Other devices group.
vioscsi, the SCSI block driver, applies to SCSI Controller in the Other devices group.
qemupciserial, the emulated PCI serial driver, applies to PCI Serial Port in the Other devices group.
qxl, the paravirtual video driver, applied to Microsoft Basic Display Adapter in the Display adapters group.
pvpanic, the paravirtual panic driver, applies to Unknown device in the Other devices group.
Note
Some drivers are required in the installation phase. When you are installing Windows onto the virtio block storage you have to provide an appropriate virtio driver. Namely, choose viostor driver for your version of Microsoft Windows, eg. does not install XP driver when you run Windows 10.
Other drivers can be installed after the successful windows installation. Again, please install only drivers matching your Windows version.
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-install-during-windows-install","title":"How to install during Windows install?","text":"
To install drivers before the Windows starts its install, make sure you have virtio-win package attached to your VirtualMachine as SATA CD-ROM. In the Windows installation, choose advanced install and load driver. Then please navigate to loaded Virtio CD-ROM and install one of viostor or vioscsi, depending on whichever you have set up.
Step by step screenshots:
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-install-after-windows-install","title":"How to install after Windows install?","text":"
After windows install, please go to Device Manager. There you should see undetected devices in \"available devices\" section. You can install virtio drivers one by one going through this list.
For more details on how to choose a proper driver and how to install the driver, please refer to the Windows Guest Virtual Machines on Red Hat Enterprise Linux 7.
"},{"location":"user_workloads/windows_virtio_drivers/#how-to-obtain-virtio-drivers","title":"How to obtain virtio drivers?","text":"
The virtio Windows drivers are distributed in a form of containerDisk, which can be simply mounted to the VirtualMachine. The container image, containing the disk is located at: https://quay.io/repository/kubevirt/virtio-container-disk?tab=tags and the image be pulled as any other docker container:
However, pulling image manually is not required, it will be downloaded if not present by Kubernetes when deploying VirtualMachine.
"},{"location":"user_workloads/windows_virtio_drivers/#attaching-to-virtualmachine","title":"Attaching to VirtualMachine","text":"
KubeVirt distributes virtio drivers for Microsoft Windows in a form of container disk. The package contains the virtio drivers and QEMU guest agent. The disk was tested on Microsoft Windows Server 2012. Supported Windows version is XP and up.
The package is intended to be used as CD-ROM attached to the virtual machine with Microsoft Windows. It can be used as SATA CDROM during install phase or to provide drivers in an existing Windows installation.
Attaching the virtio-win package can be done simply by adding ContainerDisk to you VirtualMachine.
spec:\n domain:\n devices:\n disks:\n - name: virtiocontainerdisk\n # Any other disk you want to use, must go before virtioContainerDisk.\n # KubeVirt boots from disks in order ther are defined.\n # Therefore virtioContainerDisk, must be after bootable disk.\n # Other option is to choose boot order explicitly:\n # - https://kubevirt.io/api-reference/v0.13.2/definitions.html#_v1_disk\n # NOTE: You either specify bootOrder explicitely or sort the items in\n # disks. You can not do both at the same time.\n # bootOrder: 2\n cdrom:\n bus: sata\nvolumes:\n - containerDisk:\n image: quay.io/kubevirt/virtio-container-disk\n name: virtiocontainerdisk\n
Once you are done installing virtio drivers, you can remove virtio container disk by simply removing the disk from yaml specification and restarting the VirtualMachine.