Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce HTTP retry rate from 5 to 3 #557

Closed
wants to merge 1 commit into from
Closed

Reduce HTTP retry rate from 5 to 3 #557

wants to merge 1 commit into from

Conversation

seanlinsley
Copy link
Member

5 attempts is probably too much, especially when each can take 30 seconds (or is it 120 seconds from Timeout?). This might explain why we've seen the collector take so long to shut down after pressing Control-C.

For context: RetryMax is the max retries after an initial failure, so the actual number of requests will be 1 + RetryMax

@seanlinsley seanlinsley requested a review from a team July 1, 2024 15:36
@lfittl
Copy link
Member

lfittl commented Jul 1, 2024

This might explain why we've seen the collector take so long to shut down after pressing Control-C.

I think even if we're in a wait-and-retry, a context cancellation should be able to stop that and exit immediately - there is definitely a bug there somewhere that needs fixing, separately from this PR.

@@ -405,7 +405,7 @@ func CreateHTTPClient(conf ServerConfig, logger *util.Logger, retry bool) *http.
client := retryablehttp.NewClient()
client.RetryWaitMin = 1 * time.Second
client.RetryWaitMax = 30 * time.Second
client.RetryMax = 4
client.RetryMax = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we map out the effective difference of how long we waited at most before in total, vs how long we wait now? (based on https://github.com/hashicorp/go-retryablehttp/blob/v0.7.0/client.go#L470)

Mainly I'm wondering if there is an argument that its good to wait a few minutes if our backend services are temporarily unavailable (if the max was always used, previously we waited at most 2 mins, and now we'd wait at most 1 minute -- but I think in practice those numbers would be lower, given the exponential backoff logic that's capped to that max)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I remember it, HTTP retry was added to work around temporary network connectivity issues, not to smooth over API outages (which should be very rare).

@lfittl
Copy link
Member

lfittl commented Jul 11, 2024

This might explain why we've seen the collector take so long to shut down after pressing Control-C.

I think even if we're in a wait-and-retry, a context cancellation should be able to stop that and exit immediately - there is definitely a bug there somewhere that needs fixing, separately from this PR.

@seanlinsley FYI, I think I found the root cause(s) of the cancellation behavior: #563 and #565

So maybe we don't need this PR if we merge those two instead?

@seanlinsley seanlinsley deleted the retry-rate branch July 11, 2024 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants