-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #979 operator internet checker is catching transient errors #980
Conversation
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
@dkorzuno I am running the e2e tests as there is an e2e test for this feature here: https://github.com/Azure/ARO-RP/blob/master/test/e2e/operator.go#L85-L110 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good, let's wait for the e2e
Hmm, e2e not reporting, let's try again /azp run e2e |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
Commenter does not have sufficient privileges for PR 980 in repo Azure/ARO-RP |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run e2e |
Azure Pipelines successfully started running 1 pipeline(s). |
|
||
urlErrors := map[string]string{} | ||
checks := make(map[string]chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A map of channels is overkill and makes the concurrency harder to reason about. Instantiate a single channel used by all the goroutines. Have checkWithRetry return the full "url: error" text so that you don't need to recombine the url into the error.
client simpleHTTPClient, | ||
url string, | ||
backoff wait.Backoff, | ||
ch chan error, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a usual method signature which returns error in order to avoid propagating the channel where it doesn't need to go. Makes unit testing easier and removes temptation to abuse the channel later in some way.
@@ -24,21 +25,29 @@ import ( | |||
"github.com/Azure/ARO-RP/pkg/operator/controllers" | |||
) | |||
|
|||
const ( | |||
// schedule the check every 5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as implemented, doesn't the check take up to 2.5 minutes, then it'll wait for 5? The comment seems misleading and I'm not sure that's the implemented functionality makes sense anyway? I'd be inclined to reschedule after one minute like everything else does.
ctx, cancel := context.WithTimeout(context.Background(), requeueInterval/2) | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is already being done by checkBackoff - suggest using one mechanism to do this rather than two, to make it easier to understand
backoff wait.Backoff, | ||
ch chan error, | ||
) { | ||
ch <- retry.OnError(backoff, func(_ error) bool { return true }, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wait.ExponentialBackoff would be a better candidate to use than retry.OnError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think wait.ExponentialBackoff would be a better candidate to use than retry.OnError.
Not sure I completely understand the reason for that.
The idea is to retry a URL check if there is an error. This precisely matches the retry.OnError
documentation. wait.ExponentialBackoff
requires the function to return ok, err
tuple. What would be ok
component in this case? Also, retry.OnError
calls wait.ExponentialBackoff
under the hood. It seems to me that if I were to change it to wait.ExponentialBackoff
I'd end up re-writing the body of retry.OnError
on the call site.
It was easier to create a new branch on top of the most recent master than to re-base the current branch. I created another PR #1020 which should address most of the comments. |
Which issue this PR addresses:
Fixes #979 by retrying each request multiple times.
Proposed solution
Each URL from the check list is tested concurrently. A failed request is retried using exponential randomized back-offs. The number of retries/test run time is capped. Currently it will retry at most 5 times and no longer than 2m and 30s.
Test plan for issue:
There is a corresponding unit test in
internetchecker_controller_test.go
which tests different scenarios of retry logic.As for integration test it is possible to change the list of checked URLs by running the command
and then verify the effects running the operator locally and after deploying it into the cluster following the steps in operator's README.
Is there any documentation that needs to be updated for this PR?
Probably not.