-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PR #6064/0a5ac4aa backport][3.63] [SAT-29018] Fix/corrupted RA blocks content streaming #6161
Conversation
Can't backport because it contains migrations. |
On a request for on-demand content in the content app, a corrupted Remote that contains the wrong binary (for that content) prevented other Remotes from being attempted on future requests. Now the last failed Remotes are temporarily ignored and others may be picked. Closes pulp#5725 (cherry picked from commit 0a5ac4a)
43a4de6
to
2a44212
Compare
2a44212
to
414aaae
Compare
.github/workflows/scripts/script.sh
Outdated
# See pulpcore.app.util.ENABLE_6064_BACKPORT_WORKAROUND for context. | ||
# This needs to be set here because it relies on service init. | ||
# Its being tested in only one scenario to have both cases covered. | ||
if [[ "$TEST" == "s3" ]]; then | ||
cmd_prefix pulpcore-manager backport-patch-6064 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is managed by the plugin template.
Can you maybe do it is a post_before_script hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I've missed that. Yes, thanks
|
||
if pulpcore.app.util.failed_at_exists(connection, RemoteArtifact): | ||
pulpcore.app.util.ENABLE_6064_BACKPORT_WORKAROUND = True | ||
RemoteArtifact.add_to_class("failed_at", models.DateTimeField(null=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the implementation, but the effect is like adding the field dynamically.
For example, django will be able to use the filter RemoteArtifact.objects.exclude(failed_at__gte=Y)
. If the field really exist in the database, it succeeds, otherwise it raises a ProgrammingError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And something in this PR is altering the actual db table, so this can be used?
That feels like reinventing the whole db migrations framework with out the safeguards. A subsequent upgrade is then probably going to fail. When we said "You cannot backport a migration.", that meant you cannot add db altering code to a release branch assuming that all db alteration would be done by a migration in the django framework. My gut feeling is this is way too dangerous.
Can you think of a solution that does not require changing the db schema? We should be lucky by the fact that this is kind of ephemeral data.
- Would it help to keep a per-worker list (maybe a bloom filter) in memory?
- Can we repurpose another datetime field that we don't rely on there?
- Can we use redis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A subsequent upgrade is then probably going to fail.
My assumption was that a field addition (that doesn't have any other couplings) would be safe. But I can see this is a sensitive area. I'll explore those alternatives.
(I had though of per-worker cache, but concluded it would be simpler to use the db - before knowing about the backport problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for understanding my concerns.
At this point I think postgres may even reject to apply the migration on top of this out of bounds change.
If I could choose, i'd prefer the per worker in memory caching solution. Even if it would only solve the problem half way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the idea of repurposing another field, there is pulp_created
and pulp_last_updated
.
But I'm afraid of unexpected side-effects, like pulp_last_updated
being updated by something and cooling down a good remote.
Or something else (thus, unexpected), because those are in the system for so long.
pulp_created | timestamp with time zone | | not null |
pulp_last_updated | timestamp with time zone | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not about a Remote, but the RemoteArtifact, right? I'm not so concerned as this class is only used internally and never visible to the user. I highly doubt that we have any logic depending on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, RAs.
Well, I'll open a PR. If nothing bad happens during tests I think we are good.
Upgrading with the workaround of using another field name worked, but this would be preferable.
3051539
to
bb3f1a9
Compare
The re-purpose of |
On a request for on-demand content in the content app, a corrupted Remote that contains the wrong binary (for that content) prevented other Remotes from being attempted on future requests.
Now the last failed Remotes are temporarily ignored and others may be picked.
Closes #5725
(cherry picked from commit 0a5ac4a)