Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: incomplete multi-part checkpoint handling when no hint is provided #641
base: main
Are you sure you want to change the base?
fix: incomplete multi-part checkpoint handling when no hint is provided #641
Changes from 6 commits
9a228ed
ad85ff7
059c2c3
39f34d3
9dd3367
206250d
ec3ca83
537190a
9b170fb
4dd4e5b
961398a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above already did this validation "in line" while populating the hashmap:
ParsedLogPath
constructor already ensures that(1..=num_parts).contains(part_num)
. There's no way to get a seemingly-complete incomplete checkpoint by wrongly numbered parts.find
ignores any incomplete checkpoint (any missing part would make the vec smaller thannum_parts
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the breakdown, most of the validation in
validate_checkpoint_parts
is redundant due to the in-line validation you mentioned. For the sake of caution, I would still like to warn the user when encountering scenarios covered currently in thevalidate_checkpoint_parts
function.To do this, I've sprinkled in logging for scenarios below when grouping the checkpoints:
In
group_checkpoint_parts
, when we encounter a multi-part checkpoint which has arrived out of order, we do not try to build a complete checkpoint with it (we do not add it to our hashmap) as we assume that the parts should arrive in order. Thus we are maybe left with an incomplete checkpoint part, as we could still find the remaining sequence of checkpoint parts in our iteration. I've added logging here for encountering an unexpected part number when grouping. We could additionally/alternatively do a second pass after grouping to find & warn for incomplete checkpoint parts as I remembered this was particularly important for this issue in general @OussamaSaoudiThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re invalid file types and UUID checkpoints -- the
is_checkpoint
filter before this code should have already removed those. So any logging about UUID checkpoints would need to happen before that filtering, not here, which makes me question the utility of logging here for users' sake. The checks here are only really needed because rustc demands complete matches and it doesn't grok the concept that some cases could be statically impossible.Meanwhile, why would multiple single-part checkpoints be a problem? The spec allows one classic checkpoint, one multi-part checkpoint with a single part, and any number of uuid checkpoints (tho the latter should be filtered out long before now). Any of them should be equivalent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the logging would only reveal errors if
.is_checkpoint
did not behave correctly. Will summarize bullet points 1 & 2 with a single warn if file type is neither commit in the filtering you mentioned.I see, the spec indeed allows for single-part checkpoints and we are correctly handling multiple single-part checkpoints as well. Thanks for the heads-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zachschuermann curious as to your thoughts about doing an additional pass after grouping the multi-part checkpoints to find an incomplete multi-part checkpoint for the purpose of warning the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of code to get a complete checkpoint. I wonder if we should extract this into this:
This makes the closure shorter/cleaner. It also lets us document the incomplete checkpoint case that @scovich brought up. It's not an obvious scenario, and I think it warrants some documentation.
Interested to hear your thoughts @zachschuermann
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for helper function, good idea. Based on Rule of 30, you might even consider pulling out a sub-helper to do the grouping on behalf of the helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Based on all the other suggestions, I would recommend just one helper, to build the hashmap:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be worth checking that:
MultiPartCheckpoint
0..n
.@zachschuermann what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's actually
1..=n
. And yes, we should check. Also have to be careful because technically there could be two incomplete checkpoints with different num_parts for the same version. Also, we MUST accept at most one checkpoint -- even if multiple complete checkpoints are available -- so this function needs to filter, not just check.Unfortunately, the poorly-chosen naming convention for multi-part checkpoint files means they interleave:
... which makes it a lot harder to identify the files of a given checkpoint and also means we can't just return a subslice in case there were multiple checkpoints to choose from.
We'd probably need to build a hash map keyed by number of parts:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reminds me to use match statements to their full power in the future. Thx for the example Ryan!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, did not consider the multiple incomplete checkpoints. I'll introduce tests to cover some of these scenarios. And thanks a lot for the example!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This would imply either duplicates or part numbers outside the checkpoint's part range. The former should be impossible for a correct listing, and the latter would produce an error in ParsedLogPath::try_from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm considering we have this place and some below, I wonder if we should modify semantics such that the function returns a
Result<bool>
and we could leverage result to returnError::internal
for some unexpected cases?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe, but the whole function shouldn't even exist -- see other comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, @scovich pointed out that much of the logging here is redundant. I've summarized most to a single log when encountering an unknown file type when filtering the batched
log_files
. ref to discussion #641 (comment)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering in what case this would happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In uncareful grouping code, it could happen if there were also 1+
UuidCheckpoint
instances and we blindly appended them all to the Vec fornum_parts: 1
in the hash table. Because single-part checkpoints always compare lexically greater than uuid checkpoints, the single-part would come last and be found here.But our code creates a new Vec every time it encounters a first part num, so that case shouldn't arise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never happen, and would imply that
is_checkpoint
is incorrect. Is this panic-worthy?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that as well, but panics are exceptionally unfriendly to whatever engine embeds kernel.
At some point we have to rely on correctness of
is_checkpoint
?