Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAYBE fix flaky TestPolicySetsCreate/with_vcs_policy_updated #811

Merged
merged 1 commit into from
Nov 8, 2023

Conversation

nfagerlund
Copy link
Member

@nfagerlund nfagerlund commented Nov 7, 2023

Description

This flake has been extremely rowdy lately. The flake signature is:

  • PoliciesPath is zeroed out
  • After logging that failure, it segfaults on the next line (nil pointer dereference)

It turns out PoliciesPath is kind of a red herring — the backend erases that
value if the VCSRepo is removed, as documented in the API reference. The real
action is in VCSRepo, which comes back as a nil pointer after that update. And,
if you modify the test to do an additional Read on the policy set to compare, it
agrees with what the Update returned — the repo got nilled out for some reason.

I went looking in the backend itself, and it turns out that OAuthClients have
some async cleanup behavior on deletion. And the prior subtest that sets up the
policy set used by the flaky test cleans up its OAuthClient when it's done,
leaving the next subtest to create a new one. My working theory is that there's
a race in the backend: if the Update call with the new VCSRepo values manage to
slip in before the async cleanup for the old OAuthClient is done, it'll get
nilled out instead of creating a new VCSRepo model.

IF this guess is right, then we can avoid the flake by not trying to eagerly
clean up the first OAuthClient before the rest of the subtests run. We're
cleaning up the whole org at the end of the outer test block anyway, so there's
no real benefit to being in a hurry, and an org can have multiple OAuthClient
instances at once.

Testing plan

  1. Dang test stops flaking in CI!!!!

External links

Jira is TF-9569.

Output from tests

see CI

This flake has been extremely rowdy lately. The flake signature is:

- PoliciesPath is zeroed out
- After logging that failure, it segfaults on the next line (nil pointer dereference)

It turns out PoliciesPath is kind of a red herring — the backend erases that
value if the VCSRepo is removed, as documented in the API reference. The real
action is in VCSRepo, which comes back as a nil pointer after that update. And,
if you modify the test to do an additional Read on the policy set to compare, it
agrees with what the Update returned — the repo got nilled out for some reason.

I went looking in the backend itself, and it turns out that OAuthClients have
some async cleanup behavior on deletion. And the prior subtest that sets up the
policy set used by the flaky test cleans up its OAuthClient when it's done,
leaving the next subtest to create a new one. My working theory is that there's
a race in the backend: if the Update call with the new VCSRepo values manage to
slip in before the async cleanup for the old OAuthClient is done, it'll get
nilled out instead of creating a new VCSRepo model.

IF this guess is right, then we can avoid the flake by not trying to eagerly
clean up the first OAuthClient before the rest of the subtests run. We're
cleaning up the whole org at the end of the outer test block anyway, so there's
no real benefit to being in a hurry, and an org can have multiple OAuthClient
instances at once.
@nfagerlund nfagerlund force-pushed the nf/nov23-flaky-policy-sets branch from 754f394 to f8b230b Compare November 8, 2023 02:57
@nfagerlund nfagerlund changed the title Investigate flaky TestPolicySetsCreate/with_vcs_policy_updated test MAYBE fix flaky TestPolicySetsCreate/with_vcs_policy_updated Nov 8, 2023
@nfagerlund nfagerlund marked this pull request as ready for review November 8, 2023 03:02
@nfagerlund nfagerlund requested a review from a team as a code owner November 8, 2023 03:02
@nfagerlund
Copy link
Member Author

I've re-run this test suite like three times and can't get it to flake, whereas it was flaking every other time when I was trying to do anything ELSE in this repo this week.

@nfagerlund nfagerlund merged commit 59b5f61 into main Nov 8, 2023
10 checks passed
@nfagerlund nfagerlund deleted the nf/nov23-flaky-policy-sets branch November 8, 2023 18:19
Copy link

github-actions bot commented Nov 8, 2023

Reminder to the contributor that merged this PR: if your changes have added important functionality or fixed a relevant bug, open a follow-up PR to update CHANGELOG.md with a note on your changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants