Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

with containerd v2 'kubeadm init' reports 'detected that the sandbox image "" of the container runtime is inconsistent with that used by kubeadm' #3146

Closed
robertdahlem opened this issue Jan 5, 2025 · 25 comments · Fixed by kubernetes/kubernetes#129594
Labels
area/cri kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@robertdahlem
Copy link

What happened?

Installing Kubernetes v1.32 with containerd v2.0.1, runc v1.2.3 and cni_plugins v1.6.1. /etc/containerd/config.toml is just the output of /usr/local/bin/containerd config default

kubadm init succeeds, but reports:

W0105 15:55:18.584951 7068 checks.go:846] detected that the sandbox image "" of the container runtime is inconsistent with that used by kubeadm.It is recommended to use "registry.k8s.io/pause:3.10" as the CRI sandbox image.

It seems to read something that containerd v2 no longer provides.

What did you expect to happen?

kubadm init succeeds with no warning regarding sandbox image

How can we reproduce it (as minimally and precisely as possible)?

Install containerd, runc, cni_plugins and kubeadm.

/root/kubernetes.init.conf:

apiVersion: kubeadm.k8s.io/v1beta4
kind: InitConfiguration
---
apiVersion: kubeadm.k8s.io/v1beta4
kind: ClusterConfiguration
networking:
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/16
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: nftables

# kubeadm init --config=/root/kubernetes.init.conf

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.32.0

Cloud provider

None. Installing on VMs.

OS version

# On Linux:
$ cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="9.4 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.4"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.4 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9"
BUG_REPORT_URL="https://issues.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.4
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.4"
$ uname -a
Linux kubmaster1.my.domain 5.14.0-427.42.1.el9_4.x86_64 kubernetes/kubernetes#1 SMP PREEMPT_DYNAMIC Fri Oct 18 14:35:40 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>
containerd v2.0.1
</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>
runc v1.2.3

cni_plugins v1.6.1
</details>
@robertdahlem robertdahlem added the kind/bug Categorizes issue or PR as related to a bug. label Jan 5, 2025
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 5, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@thisisharrsh
Copy link

Hi @robertdahlem, Thank you for raising this issue, as most of the users are facing this issue.

@thisisharrsh
Copy link

As a quick resolution, we can fix the /etc/containerd/config.toml file and refer to the correct sandbox image.

@thisisharrsh
Copy link

And then we can update the configuration, restart the containerd service.

@pacoxu
Copy link
Member

pacoxu commented Jan 6, 2025

cc @neolit123
/area kubeadm
/sig cluster-lifecycle

We may fix the warning and do a cherry-pick.

@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 6, 2025
@thisisharrsh
Copy link

As in the code file for this, we can add the logic for attempting to configure the runtime with the correct sandbox image.
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/preflight/checks.go#L841C2-L848C3

@pacoxu
Copy link
Member

pacoxu commented Jan 6, 2025

For containerd 1.7,

[root@prod-master1 ~]# crictl info | jq  .config.sandboxImage
"10.5.14.100/registry.k8s.io/pause:3.9"

For containerd 2.0,

[root@paco ~]# crictl info | jq  .config.sandboxImage
null

https://github.com/kubernetes/kubernetes/blob/9fc9ddc7bceca86e805f674caff7d7acf31fad6c/cmd/kubeadm/app/util/runtime/runtime.go#L274-L298

@thisisharrsh
Copy link

related PR.

@carlory
Copy link
Member

carlory commented Jan 6, 2025

related-to: containerd/containerd#11117

@pacoxu
Copy link
Member

pacoxu commented Jan 6, 2025

related-to: containerd/containerd#11117

We may wait for containerd/containerd#11114 then. That seems to be the valid fix.

Since crictl info does not expose the sandboxImage, a workaround is using crictl inspecti -o json registry.k8s.io/pause:3.10 | jq .status.pinned to filter the pinned image. This is a way to check if the current configured pause image is pinned in container runtime. But, as the check is in preflight, the image is not pulled yet. This is a little tricky and may not be the correct solution.

crictl inspecti -o json
...
    {
      "repoTags":  [
        "registry.k8s.io/pause:3.10",
      ],
      ...
      "pinned":  true
    },

@neolit123
Copy link
Member

cc @SataQiu who added the check.

We may fix the warning and do a cherry-pick.

it's just a warning though.

It seems to read something that containerd v2 no longer provides.

if the read fails kubeadm should report it instead of ""

As in the code file for this, we can add the logic for attempting to configure the runtime with the correct sandbox image. https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/preflight/checks.go#L841C2-L848C3

no, we should not configure runtimes from kubeadm.
if the warning has become problematic, better to remove it.

@neolit123
Copy link
Member

/transfer kubeadm

@k8s-ci-robot k8s-ci-robot transferred this issue from kubernetes/kubernetes Jan 6, 2025
@neolit123
Copy link
Member

neolit123 commented Jan 6, 2025

related-to: containerd/containerd#11117

We may wait for containerd/containerd#11114 then. That seems to be the valid fix.

Since crictl info does not expose the sandboxImage, a workaround is using crictl inspecti -o json registry.k8s.io/pause:3.10 | jq .status.pinned to filter the pinned image. This is a way to check if the current configured pause image is pinned in container runtime. But, as the check is in preflight, the image is not pulled yet. This is a little tricky and may not be the correct solution.

crictl inspecti -o json
...
    {
      "repoTags":  [
        "registry.k8s.io/pause:3.10",
      ],
      ...
      "pinned":  true
    },

that does seem like the problem.
the bug is in containerd.

@neolit123 neolit123 added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/cri and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Jan 6, 2025
@SataQiu
Copy link
Member

SataQiu commented Jan 6, 2025

This is just a warning to alert the user to check the sandboxImage configuration of the container runtime. The actual kubeadm execution process is not affected.

@afbjorklund
Copy link

afbjorklund commented Jan 6, 2025

the bug is in containerd.

containerd is just dumping the config, like it has always done. the "bug" is that CRI is expecting things in the map

The feature already broke with CRI-O and Docker before, since they did not dump the containerd 1.x config either...

kubernetes/kubernetes#115610 (comment)

containerd 2.x continues to dump the runtime config, it is just that sandbox has moved to the image config now:

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.8"
    [plugins.'io.containerd.cri.v1.images'.pinned_images]
      sandbox = 'registry.k8s.io/pause:3.10'

I don't think CRI exposes any method to peek at the other config, so the workaround just copies the value over...

For a manual check, the user could do containerd config dump and look in the other section for pinned_images

@afbjorklund

This comment was marked as off-topic.

@neolit123
Copy link
Member

The Kubernetes documentation still recommends to use pause version 3.2, it probably should tell how to check?

https://kubernetes.io/docs/setup/production-environment/container-runtimes/#override-pause-image-containerd

There are other random values for the other runtimes, so it probably needs a documentation constant defined...

Currently we are checking the output of kubeadm config images list, and looking for the "pause" image in there

that's a topic for a k/website ticket owned by sig-node.

@afbjorklund

This comment has been minimized.

@neolit123
Copy link
Member

merged, that's all we need to do in kubeadm.
note it will not be backported to older releases it's just a warning.

cri/containerd bugs should be tracked in their respective repos.

@neolit123 neolit123 added this to the v1.33 milestone Jan 13, 2025
@afbjorklund
Copy link

afbjorklund commented Jan 13, 2025

It seems like the error can "escape" to the reporting, if there is need to pull an image:

[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action beforehand using 'kubeadm config images pull'
I0113 17:25:14.767420  230946 checks.go:832] using image pull policy: IfNotPresent
I0113 17:25:14.768108  230946 checks.go:863] image exists: registry.k8s.io/kube-apiserver:v1.32.0
I0113 17:25:14.768313  230946 checks.go:863] image exists: registry.k8s.io/kube-controller-manager:v1.32.0
I0113 17:25:14.768502  230946 checks.go:863] image exists: registry.k8s.io/kube-scheduler:v1.32.0
I0113 17:25:14.768656  230946 checks.go:871] pulling: registry.k8s.io/kube-proxy:v1.32.0
I0113 17:25:21.973237  230946 checks.go:871] pulling: registry.k8s.io/coredns/coredns:v1.12.0
I0113 17:25:24.852871  230946 checks.go:871] pulling: registry.k8s.io/pause:3.10
I0113 17:25:26.408445  230946 checks.go:871] pulling: registry.k8s.io/etcd:3.5.16-0
error execution phase preflight: [preflight] Some fatal errors occurred:
	[ERROR ImagePull]: failed to check if image registry.k8s.io/kube-proxy:v1.32.0 exists: no 'sandboxImage' field in CRI info config
	[ERROR ImagePull]: failed to check if image registry.k8s.io/coredns/coredns:v1.12.0 exists: no 'sandboxImage' field in CRI info config
	[ERROR ImagePull]: failed to check if image registry.k8s.io/pause:3.10 exists: no 'sandboxImage' field in CRI info config
	[ERROR ImagePull]: failed to check if image registry.k8s.io/etcd:3.5.16-0 exists: no 'sandboxImage' field in CRI info config
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

@neolit123 neolit123 reopened this Jan 13, 2025
@neolit123
Copy link
Member

@SataQiu @pacoxu the above doesn't seem right. if image pull is failing it should not be reporting about sandbox image.

@afbjorklund
Copy link

afbjorklund commented Jan 13, 2025

For some reason it is re-using the previous "err" variable, and I don't fully see how the actual error is transported:

                case v1.PullIfNotPresent:
                        if ipc.runtime.ImageExists(image) {
                                klog.V(1).Infof("image exists: %s", image)
                                continue
                        }
                        if err != nil {
                                errorList = append(errorList, errors.Wrapf(err, "failed to check if image %s exists", image))
                        }
                        fallthrough // Proceed with pulling the image if it does not exist

i.e. ImageExists should probably return an error?

        if err != nil {
                klog.Warningf("Failed to get image status, image: %q, error: %v", image, err)
                return false
        }

@neolit123
Copy link
Member

neolit123 commented Jan 13, 2025

this should be removed

			if err != nil {
				errorList = append(errorList, errors.Wrapf(err, "failed to check if image %s exists", image))
			}

it uses the prior error from criSandboxImage, err := ipc.runtime.SandboxImage()

sending fix in a bit.

edit: here it is:

@neolit123
Copy link
Member

neolit123 commented Jan 14, 2025

this should be removed

			if err != nil {
				errorList = append(errorList, errors.Wrapf(err, "failed to check if image %s exists", image))
			}

it uses the prior error from criSandboxImage, err := ipc.runtime.SandboxImage()

sending fix in a bit.

edit: here it is:

cherry picks for 1.31 and 1.32

@afbjorklund
Copy link

afbjorklund commented Jan 14, 2025

The bug was introduced here: kubernetes/kubernetes@7d1bfd9, when the error checking was removed

@@ -857,8 +857,7 @@
        for _, image := range ipc.imageList {
                switch policy {
                case v1.PullIfNotPresent:
-                       ret, err := ipc.runtime.ImageExists(image)
-                       if ret && err == nil {
+                       if ipc.runtime.ImageExists(image) {
                                klog.V(1).Infof("image exists: %s", image)
                                continue
                        }

(the error was never being returned from the impl, anyway)

@@ -188,9 +225,11 @@
 }
 
 // ImageExists checks to see if the image exists on the system
-func (runtime *CRIRuntime) ImageExists(image string) (bool, error) {
-       err := runtime.crictl("inspecti", image).Run()
-       return err == nil, nil
+func (runtime *CRIRuntime) ImageExists(image string) bool {
+       ctx, cancel := defaultContext()
+       defer cancel()
+       _, err := runtime.impl.ImageStatus(ctx, runtime.imageService, &runtimeapi.ImageSpec{Image: image}, false)
+       return err == nil
 }

The pull should fail with a proper error later on, if the runtime is down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cri kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants