Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix kubernetes_daemon_set not waiting for rollout. Fixes #2092 #2419

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sbocinec
Copy link
Contributor

@sbocinec sbocinec commented Feb 9, 2024

Description

Fixes kubernetes_daemon_set_v1 resource to wait for rollout. Current check for the rollout when wait_for_rollout = true is ineffective and is not waiting for rollout at all as it checks a wrong status field of a DaemonSet.

❗ What is worse, there is currently no way how to correctly manage a DaemonSet using kubernetes provider as also kubernetes_manifest is affected by a nasty bug so every DaemonSet change ends up with a failure..

Impacted by the issue, I prepared a reproducer TF code here. Troubleshooting the issue by applying a new resource and changing existing one I noticed, the current code is checking a wrong field, so wait_for_rollout = true has no effect as incorrect status fiels is check.

Additionally, majority of the daemonset resource tests must have been fixed as after fixing the rollout issues, these have been failing as the the original manifests failed to be rolled out.

⚠️ this fix might break many existing DaemonSets as by default wait_for_rollout is set to true in the resource. Until now, waiting for rollout was ineffective, though after it's fixed, apply is going to wait until all the daemonset pods are in Ready state. I expect this might cause some issues for users using this resource as their existing TF code might start to time out on apply. This is also visible on the number of test case changes, that must have been updated.

Troubleshooting - apply of a new resource

Plan: 1 to add, 0 to change, 0 to destroy.                                                                                                                                                                                                                                                                                                                              

kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Creating...                                                                                                                   
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Creation complete after 2s [id=default/i-am-not-waiting-for-rollout]                                                          
                                                                                          
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.      

The apply is almost instant. While the apply is running,, this is how the daemonset status looks like: `

$ while true; do kubectl get ds/i-am-not-waiting-for-rollout --output="jsonpath={.status}"  ;echo;  done` printing the DS Error from server (NotFound): daemonsets.apps "i-am-not-waiting-for-rollout" not found
...                                                                                          
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberMisscheduled":0,"numberReady":0,"numberUnavailable":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberMisscheduled":0,"numberReady":0,"numberUnavailable":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":1,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":1,"updatedNumberScheduled":2}

As you can see, checking CurrentNumberScheduled is not the right property to check if when we need to wait for the rollout as this number does not represent the DaemonSet rollout status. We must check for numberReady here.

Troubleshooting - change of an existing resource

Apply of a change:

Plan: 0 to add, 1 to change, 0 to destroy.                                                                                                                                           
                                                                                         
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Modifying... [id=default/i-am-not-waiting-for-rollout]                                                                        
kubernetes_daemon_set_v1.i-am-not-waiting-for-rollout: Modifications complete after 2s [id=default/i-am-not-waiting-for-rollout]                                                                                                                                                                                                                                           
                                                                                          
Apply complete! Resources: 0 added, 1 changed, 0 destroyed.                                                                          

DS status output:

{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":5,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":1}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":1,"numberMisscheduled":0,"numberReady":1,"numberUnavailable":1,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":6,"updatedNumberScheduled":2}
{"currentNumberScheduled":2,"desiredNumberScheduled":2,"numberAvailable":2,"numberMisscheduled":0,"numberReady":2,"observedGeneration":6,"updatedNumberScheduled":2}

Acceptance tests

  • Have you added an acceptance test for the functionality being added?
  • Have you run the acceptance tests on this branch?

Output from acceptance testing:

$ make testacc TESTARGS='-run=TestAccKubernetesDaemonSetV1* -v' 
==> Checking that code complies with gofmt requirements...
go vet ./...
TF_ACC=1 go test "/home/stano/workspace/projects/tf/terraform-provider-kubernetes/kubernetes" -v -vet=off -run=TestAccKubernetesDaemonSetV1* -v -parallel 8 -timeout 3h
=== RUN   TestAccKubernetesDaemonSetV1_minimal
=== PAUSE TestAccKubernetesDaemonSetV1_minimal
=== RUN   TestAccKubernetesDaemonSetV1_basic
=== PAUSE TestAccKubernetesDaemonSetV1_basic
=== RUN   TestAccKubernetesDaemonSetV1_with_template_metadata
=== PAUSE TestAccKubernetesDaemonSetV1_with_template_metadata
=== RUN   TestAccKubernetesDaemonSetV1_initContainer
=== PAUSE TestAccKubernetesDaemonSetV1_initContainer
=== RUN   TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== PAUSE TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== RUN   TestAccKubernetesDaemonSetV1_with_tolerations
=== PAUSE TestAccKubernetesDaemonSetV1_with_tolerations
=== RUN   TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== PAUSE TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== RUN   TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
=== PAUSE TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
=== RUN   TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
=== PAUSE TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
=== RUN   TestAccKubernetesDaemonSetV1_with_resource_requirements
=== PAUSE TestAccKubernetesDaemonSetV1_with_resource_requirements
=== RUN   TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== PAUSE TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== CONT  TestAccKubernetesDaemonSetV1_minimal
=== CONT  TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds
=== CONT  TestAccKubernetesDaemonSetV1_noTopLevelLabels
=== CONT  TestAccKubernetesDaemonSetV1_with_template_metadata
=== CONT  TestAccKubernetesDaemonSetV1_with_resource_requirements
=== CONT  TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace
=== CONT  TestAccKubernetesDaemonSetV1_initContainer
=== CONT  TestAccKubernetesDaemonSetV1_with_tolerations
--- PASS: TestAccKubernetesDaemonSetV1_with_tolerations_unset_toleration_seconds (18.46s)
=== CONT  TestAccKubernetesDaemonSetV1_basic
--- PASS: TestAccKubernetesDaemonSetV1_minimal (18.49s)
=== CONT  TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile
--- PASS: TestAccKubernetesDaemonSetV1_noTopLevelLabels (18.54s)
=== CONT  TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile
--- PASS: TestAccKubernetesDaemonSetV1_with_tolerations (19.22s)
--- PASS: TestAccKubernetesDaemonSetV1_initContainer (22.81s)
--- PASS: TestAccKubernetesDaemonSetV1_with_template_metadata (31.53s)
--- PASS: TestAccKubernetesDaemonSetV1_minimalWithTemplateNamespace (32.90s)
--- PASS: TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_localhost_profile (14.42s)
--- PASS: TestAccKubernetesDaemonSetV1_with_container_security_context_seccomp_profile (23.03s)
--- PASS: TestAccKubernetesDaemonSetV1_with_resource_requirements (46.52s)
--- PASS: TestAccKubernetesDaemonSetV1_basic (28.36s)
PASS
ok  	github.com/hashicorp/terraform-provider-kubernetes/kubernetes	46.870s

Release Note

Release note for CHANGELOG:

`resource/kubernetes_daemon_set_v1`: fix an issue with the provider not waiting for rollout even `wait_for_rollout` is set to `true`

References

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch from fa9e3fe to 8426fa4 Compare February 12, 2024 06:15
@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch from 8426fa4 to d330044 Compare February 19, 2024 15:13
@github-actions github-actions bot added size/S and removed size/XS labels Feb 19, 2024
@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch from d330044 to 4187873 Compare February 19, 2024 15:14
@sbocinec sbocinec marked this pull request as ready for review February 19, 2024 15:14
@sbocinec sbocinec requested a review from a team as a code owner February 19, 2024 15:14

resources {
limits = {
cpu = "0.5"
memory = "512Mi"
"nvidia/gpu" = "1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is causing the pods are unschedulable and thus rollout fails and the test never passes.

@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch from 4187873 to fde82c2 Compare February 19, 2024 15:18
@@ -912,6 +916,7 @@ func testAccKubernetesDaemonSetV1ConfigWithContainerSecurityContextSeccompProfil
}
}
}
wait_for_rollout = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: rollout of this resource fails on K8s nodes that do not have any seccomp profiles locally available (e.g. nodes in a kind cluster), so let's not wait for rollout, as it's not important for this test.

@@ -962,6 +967,7 @@ func testAccKubernetesDaemonSetV1ConfigWithContainerSecurityContextSeccompProfil
}
}
}
wait_for_rollout = false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: rollout of this resource fails on K8s nodes that do not have any seccomp profiles locally available (e.g. nodes in a kind cluster), so let's not wait for rollout, as it's not important for this test.

@@ -685,6 +685,7 @@ func testAccKubernetesDaemonSetV1ConfigWithTemplateMetadata(depName, imageName s
container {
image = "%s"
name = "containername"
command = ["sleep", "infinity"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: the container commands must be added as by default the container just exits and the test is thus failing. This is now added for the majority of the test cases for this resource.

@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch from fde82c2 to 5f6ee8a Compare February 19, 2024 15:27
@github-actions github-actions bot added size/M and removed size/S labels Feb 19, 2024
@sbocinec sbocinec force-pushed the ds-fix-wait-for-rollout branch 3 times, most recently from 8321d84 to f4cbd9b Compare February 19, 2024 15:34
@alexsomesan alexsomesan force-pushed the ds-fix-wait-for-rollout branch from f4cbd9b to be1c10e Compare April 4, 2024 13:48
Copy link
Member

@alexsomesan alexsomesan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbocinec thank you for contributing this fix and the very detailed diagnosis of the issue.

The changes look good to me. While we wait for the tests to run, would you mind expanding the change log entry a bit to describe the change in behaviour that users can expect after upgrading?

@@ -0,0 +1,4 @@
```release-note:bug
`resource/kubernetes_daemon_set_v1`: fix an issue with the provider not waiting for rollout with `wait_for_rollout = true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a mention of the change in behaviour here, so users are well advised of what to expect when upgrading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexsomesan That's a great point. I'm not sure about the wording of the note. I'm thinking about adding the following to the existing entry: "Note: As the wait_for_rollout is true by default, users might experience the apply operation of the existing code taking longer."

Or should the UPGRADE NOTES section be used instead as I see it was used in past e.g. https://github.com/hashicorp/terraform-provider-kubernetes/blob/main/CHANGELOG.md#161-april-18-2019 . In this case, I'm not sure how to add this extra section.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is undesirable to change the semantics of the existing wait_for_rollout attribute (due to potential breakage of existing code), perhaps a new really_wait_for_rollout attribute can be introduced instead, and the old one deprecated.

sbocinec added 2 commits April 5, 2024 16:04
This fix ensures the resource correctly waits for the DaemonSet rollout
by using similar logic than the kubernetes_deployment. Additionally,
waitForDaemonSetReplicasFunc is renamed to waitForDaemonSetPodsFunc to
correctly describe the operation (there are no replicas in a DaemonSet).
@alexsomesan alexsomesan force-pushed the ds-fix-wait-for-rollout branch from be1c10e to 080bb15 Compare April 5, 2024 14:04
@atykhyy
Copy link

atykhyy commented Jul 18, 2024

This is a very useful fix as it obviates the need for nasty hacky workarounds (such as a null resource with a local-exec provisioner running kubectl). Can it be included in the July release?

@BBBmau BBBmau self-assigned this Aug 7, 2024
@BBBmau BBBmau added this to the v2.23.0 milestone Aug 14, 2024
@BBBmau
Copy link
Contributor

BBBmau commented Aug 30, 2024

moving this to v3.0.0 milestone since this is considered a breaking change due to changing the expected behavior for users.

@BBBmau BBBmau modified the milestones: v2.33.0, v3.0.0 Aug 30, 2024
@atykhyy
Copy link

atykhyy commented Aug 31, 2024

moving this to v3.0.0 milestone since this is considered a breaking change due to changing the expected behavior for users.

How about making this change non-breaking by introducing a differently-named parameter?

If it is undesirable to change the semantics of the existing wait_for_rollout attribute (due to potential breakage of existing code), perhaps a new really_wait_for_rollout attribute can be introduced instead, and the old one deprecated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Daemonset wait_for_rollout does not appear to have any impact.
4 participants