Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #979 operator internet checker is catching transient errors #980

Closed
wants to merge 4 commits into from

Conversation

dkorzuno
Copy link
Contributor

Which issue this PR addresses:

Fixes #979 by retrying each request multiple times.

Proposed solution

Each URL from the check list is tested concurrently. A failed request is retried using exponential randomized back-offs. The number of retries/test run time is capped. Currently it will retry at most 5 times and no longer than 2m and 30s.

Test plan for issue:

There is a corresponding unit test in internetchecker_controller_test.go which tests different scenarios of retry logic.

As for integration test it is possible to change the list of checked URLs by running the command

oc -n openshift-azure-operator edit clusters.aro.openshift.io/cluster

and then verify the effects running the operator locally and after deploying it into the cluster following the steps in operator's README.

Is there any documentation that needs to be updated for this PR?

Probably not.

@asalkeld
Copy link
Contributor

/azp run e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@asalkeld
Copy link
Contributor

@dkorzuno I am running the e2e tests as there is an e2e test for this feature here: https://github.com/Azure/ARO-RP/blob/master/test/e2e/operator.go#L85-L110

Copy link
Contributor

@asalkeld asalkeld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good, let's wait for the e2e

@ehashman
Copy link
Contributor

Hmm, e2e not reporting, let's try again

/azp run e2e

@ehashman
Copy link
Contributor

/azp run e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dkorzuno
Copy link
Contributor Author

The relevant e2e test passes, however the e2e check fails because there are issues in other e2e tests. I created issues #984 and #985 for those.

@ehashman
Copy link
Contributor

/azp run e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

asalkeld
asalkeld previously approved these changes Sep 16, 2020
@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 980 in repo Azure/ARO-RP

@mjudeikis
Copy link
Contributor

/azp run e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@asalkeld
Copy link
Contributor

/azp run e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


urlErrors := map[string]string{}
checks := make(map[string]chan error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A map of channels is overkill and makes the concurrency harder to reason about. Instantiate a single channel used by all the goroutines. Have checkWithRetry return the full "url: error" text so that you don't need to recombine the url into the error.

client simpleHTTPClient,
url string,
backoff wait.Backoff,
ch chan error,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a usual method signature which returns error in order to avoid propagating the channel where it doesn't need to go. Makes unit testing easier and removes temptation to abuse the channel later in some way.

@@ -24,21 +25,29 @@ import (
"github.com/Azure/ARO-RP/pkg/operator/controllers"
)

const (
// schedule the check every 5m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as implemented, doesn't the check take up to 2.5 minutes, then it'll wait for 5? The comment seems misleading and I'm not sure that's the implemented functionality makes sense anyway? I'd be inclined to reschedule after one minute like everything else does.

Comment on lines +75 to +76
ctx, cancel := context.WithTimeout(context.Background(), requeueInterval/2)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already being done by checkBackoff - suggest using one mechanism to do this rather than two, to make it easier to understand

backoff wait.Backoff,
ch chan error,
) {
ch <- retry.OnError(backoff, func(_ error) bool { return true }, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wait.ExponentialBackoff would be a better candidate to use than retry.OnError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wait.ExponentialBackoff would be a better candidate to use than retry.OnError.

Not sure I completely understand the reason for that.

The idea is to retry a URL check if there is an error. This precisely matches the retry.OnError documentation. wait.ExponentialBackoff requires the function to return ok, err tuple. What would be ok component in this case? Also, retry.OnError calls wait.ExponentialBackoff under the hood. It seems to me that if I were to change it to wait.ExponentialBackoff I'd end up re-writing the body of retry.OnError on the call site.

@dkorzuno
Copy link
Contributor Author

It was easier to create a new branch on top of the most recent master than to re-base the current branch. I created another PR #1020 which should address most of the comments.

@dkorzuno dkorzuno closed this Sep 25, 2020
@dkorzuno dkorzuno deleted the internet-checker-retries branch September 29, 2020 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase branch needs a rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

operator internet checker is catching transient errors
5 participants