step execution failed: step 6 timeout of 10m0s has elapsed #3223

creeram · 2025-01-08T02:54:14Z

creeram
Jan 8, 2025

I have the code snippet below in the Kargo stage steps, which performs an HTTP GET to check whether the new version is deployed.

The URL endpoint health check output is:

{"status":"Healthy","version":"v0.1.0"}

- uses: http
  as: healthcheck-endpoint
  retry:
    errorThreshold: 5
    timeout: 10m
  config:
    method: GET
    url: https://demo.example.com/healthcheck
    successExpression: response.body.version == "${{ imageFrom(vars.imageURL).Tag }}"
    failureExpression: response.status == 404
    outputs:
      - name: client_version
        fromExpression: response.body.version

When the promotion is done, it triggers the deployment in Kubernetes. After the new version is deployed, this step should pass. It typically takes a maximum of 4-5 minutes for the changes to be reconciled in the Kubernetes cluster. However, with the above configuration, it shows as failed even though the new version is already deployed and the endpoint is healthy.

How can I configure it to redo the health check every 30 or 60 seconds to verify if the new version is deployed? It should only fail if the timeout is reached. In my case, it does not take 10 minutes to deploy a new version.

krancour · 2025-01-08T05:42:33Z

krancour
Jan 8, 2025
Maintainer

I know https://demo.example.com/healthcheck is a placeholder, but whatever your real URL is, are you certain it's reachable from where your Kargo controller is running?

I ask because if your success and failure expressions both evaluate to false, the result is indeterminate. The Promotion will idle in the queue for a while and the request will be retried after some time. If you were getting back a 403 or 503 or something, you'd keep retrying and you'd exhaust the timeout every time.

Do you have any way of confirming connectivity to the endpoint in question isn't being obstructed by a network policy or something?

2 replies

krancour Jan 8, 2025
Maintainer

I should add that I'll try to recreate this as soon as I've got the chance, just to confirm there isn't something else going on.

krancour Jan 8, 2025
Maintainer

response.body.version == "${{ imageFrom(vars.imageURL).Tag }}"

fwiw, I can at least confirm that this "expression within an expression" works as intended.

creeram · 2025-01-08T07:13:39Z

creeram
Jan 8, 2025
Author

@krancour

The endpoint is reachable, but I believe the retry HTTP request from the Kargo steps endpoint is not working.

This is because, prior to that step, I have a git push step, and the HTTP steps are executed afterward. The changes pushed to GitHub take a couple of minutes for reconciliation. I believe the issue might be with the retry mechanism—either it is not working as expected, or there is no option to define a time interval to wait before retrying.

The HTTP Kargo stage shows as failed because of this. However, when I re-promote the same version after the changes are deployed in Kubernetes, it shows as successful.

1 reply

krancour Jan 8, 2025
Maintainer

If a step concludes with "still running" (which yours does) then the Promotion also remains in a running state and is placed back on the work queue for another reconciliation attempt, which will resume steps from where it left off. It is at this point that the next http request is made. In other words, the retry mechanism isn't actually part of the step. It's just a natural consequence of how the controller works. This also means that all steps are retried in the same manner, and that is known to work correctly since widely-used steps like argocd-update and wait-for-pr depend on the same process.

As proof that this is working correctly in your case, you never would have received the message about the timeout if the step hadn't been re-tried.

Now the question about configuring the interval:

The retry interval is not currently configurable because the nature of Kubernetes controllers makes it impossible to guarantee that interval. If you requested an interval of 30 seconds, the Promotion is placed back on the queue after 30 seconds -- unless some other condition caused it to be requeued sooner. Once there, there's no guarantee over how long it will be before it's picked up and reconciled again. The controller has a finite number of workers and time spent on the queue will be a matter of what those workers are up to and how deep the queue is. If you asked for a 30 second interval, you'd get your retry somewhere between 0 and an indefinitely large time interval. This total lack of precision is why we've not (yet) made it configurable.

The hard-coded "interval" is five minutes, which probably sounds long, but it's quite common that other conditions cause the Promotion to be requeued sooner. I'm somewhat open to making that configurable, but you're rarely going to see retries actually happening on the interval you requested, which is why we didn't initially bother.

At this point, you're probably wondering why the step doesn't have an internal retry loop, which would be so much more straightforward. It's because a controller has, as I mentioned, a finite number of workers, and you don't want to tie one of those up for a long period of time or else the whole system could live lock waiting for http requests to receive some desired response.

Now about those "other conditions" that can make a Promotion be requeued sooner. One of them happens to be reconciliation of the Stage, which more or less cascades to any running promotions. I'm mentioning this, because refreshing a Stage will get all its non-terminal Promotions back onto the queue sooner than scheduled. If the queue isn't deep and the workers aren't very busy, your http request will effectively be retried almost immediately. You can take advantage of this fact to further troubleshoot your issue.

The thing that still doesn't add up here is that you claim the endpoint gives the desired response after a few minutes. The last attempt to get the desired response comes on or after the ten minute mark. You should get the desired response.

This bears more investigation, but some of the troubleshooting is going to have to happen on your end. Try manually refreshing the Stage a few times after you think the endpoint should be giving you the correct response. I don't expect anything to change, but if it somehow does, that will be informative.

creeram · 2025-01-09T12:20:14Z

creeram
Jan 9, 2025
Author

@krancour It took some time to troubleshoot, but I found out that the issue was with the CDN. The version got changed, but due to the cache, the Kargo health check failed. Sorry for the inconvenience.

1 reply

krancour Jan 9, 2025
Maintainer

@creeram quite alright. I'm glad it's resolved!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

step execution failed: step 6 timeout of 10m0s has elapsed #3223

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

step execution failed: step 6 timeout of 10m0s has elapsed #3223

creeram Jan 8, 2025

Replies: 3 comments · 4 replies

krancour Jan 8, 2025 Maintainer

krancour Jan 8, 2025 Maintainer

krancour Jan 8, 2025 Maintainer

creeram Jan 8, 2025 Author

krancour Jan 8, 2025 Maintainer

creeram Jan 9, 2025 Author

krancour Jan 9, 2025 Maintainer

creeram
Jan 8, 2025

Replies: 3 comments 4 replies

krancour
Jan 8, 2025
Maintainer

krancour Jan 8, 2025
Maintainer

krancour Jan 8, 2025
Maintainer

creeram
Jan 8, 2025
Author

krancour Jan 8, 2025
Maintainer

creeram
Jan 9, 2025
Author

krancour Jan 9, 2025
Maintainer