-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: read partition_values
in RemoveVisitor
and remove break
in RowVisitor
for RemoveVisitor
#633
feat: read partition_values
in RemoveVisitor
and remove break
in RowVisitor
for RemoveVisitor
#633
Conversation
stat
to Remove
actionstat
to Remove
action
if let Some(path) = getters[0].get_opt(i, "remove.path")? { | ||
self.removes.push(Self::visit_remove(i, path, getters)?); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah I didn't realize the break was there from the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this has been dead code for a while, but what in the world? 🤦
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find lol!
@@ -515,6 +516,13 @@ async fn data_skipping_filter() { | |||
data_change: true, | |||
..Default::default() | |||
}), | |||
// Remove action with max value id = 5 | |||
Action::Remove(Remove { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see a case where the remove action is filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not even 100% sure that data skipping works on removes. If it doesn't work and there's significant work involved in making it work, we can take that up in a followup PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the action above with fake_path_2
would've been filtered anyway because it's paired with an add action. So the stats wasn't what caused it to be filtered out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, looks like you're correct. After making the path unique for the remove
action, the filter does not correctly filter the action. Investigating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data skipping does not work reliably on removes. Even if it did work, there's technically a risk that we might filter out a remove that has stats, but fail to filter out an older add for the same file that lacks stats. For example, if the file was imported from a parquet or iceberg table, and we only added stats later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks for your input, will note this and move on for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich For the purposes of CDF: Suppose data skipping failed to filter the add. The rows of the add data file should eventually be filtered by the predicate. So this should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like dataskipping looks exclusively at add actions, so this definitely is a separate PR. I do still wonder if data skipping could be leveraged for removes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do still wonder if data skipping could be leveraged for removes
It's so tempting... but unfortunately that older add could have been removed by a replace-table operation, and have a completely incompatible schema vs. the table's current schema (on columns other than the predicate). If that happens, the read would fail at parquet read time, before we could apply the predicate to filter out the unwanted rows.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #633 +/- ##
==========================================
+ Coverage 83.46% 83.69% +0.23%
==========================================
Files 75 75
Lines 16918 16950 +32
Branches 16918 16950 +32
==========================================
+ Hits 14120 14187 +67
+ Misses 2143 2098 -45
- Partials 655 665 +10 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit on that note. Besides that, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good! just a few comments!
if let Some(path) = getters[0].get_opt(i, "remove.path")? { | ||
self.removes.push(Self::visit_remove(i, path, getters)?); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good find lol!
a59b383
to
c1bf5e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
stat
to Remove
actionstat
to Remove
action and parse partition values for Remove
c1bf5e1
to
9823136
Compare
In light of the fact that This PR should now only:
|
stat
to Remove
action and parse partition values for Remove
partition_values
in Remove
actions and remove break
in RemoveVisitor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks for iterating
7d4643e
to
c9a110a
Compare
c9a110a
to
bd932d6
Compare
partition_values
in Remove
actions and remove break
in RemoveVisitor
partition_values
in RemoveVisitor
and remove break
in RowVisitor
for RemoveVisitor
partition_values
in RemoveVisitor
and remove break
in RowVisitor
for RemoveVisitor
partition_values
in RemoveVisitor
and remove break
in RowVisitor
for RemoveVisitor
@@ -375,7 +375,7 @@ pub struct Add { | |||
/// in the added file must be contained in one or more remove actions in the same version. | |||
pub data_change: bool, | |||
|
|||
/// Contains [statistics] (e.g., count, min/max values for columns) about the data in this logical file. | |||
/// Contains [statistics] (e.g., count, min/max values for columns) about the data in this logical file encoded as a JSON string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late drive-by -- long line? (surprised format check didn't complain)
What changes are proposed in this pull request?
RemoveVisitor
to readpartition_values
RowVisitor
forRemoveVisitor
which prevented reading multipleRemove
actions in a commitHow was this change tested?
partition_values
field fromRemove
actions