Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: read partition_values in RemoveVisitor and remove break in RowVisitor for RemoveVisitor #633

Merged
merged 6 commits into from
Jan 14, 2025

Conversation

sebastiantia
Copy link
Collaborator

@sebastiantia sebastiantia commented Jan 9, 2025

What changes are proposed in this pull request?

  1. This PR updates RemoveVisitor to read partition_values
  2. This PR removes the stray break in the implementation of RowVisitor for RemoveVisitor which prevented reading multiple Remove actions in a commit

How was this change tested?

  • Introduced test to parse the partition_values field from Remove actions

@sebastiantia sebastiantia changed the title feat: add stat to Removeaction feat: add stat to Remove action Jan 9, 2025
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
if let Some(path) = getters[0].get_opt(i, "remove.path")? {
self.removes.push(Self::visit_remove(i, path, getters)?);
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah I didn't realize the break was there from the beginning

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, this has been dead code for a while, but what in the world? 🤦

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find lol!

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@@ -515,6 +516,13 @@ async fn data_skipping_filter() {
data_change: true,
..Default::default()
}),
// Remove action with max value id = 5
Action::Remove(Remove {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to see a case where the remove action is filtered out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not even 100% sure that data skipping works on removes. If it doesn't work and there's significant work involved in making it work, we can take that up in a followup PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the action above with fake_path_2 would've been filtered anyway because it's paired with an add action. So the stats wasn't what caused it to be filtered out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, looks like you're correct. After making the path unique for the remove action, the filter does not correctly filter the action. Investigating

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data skipping does not work reliably on removes. Even if it did work, there's technically a risk that we might filter out a remove that has stats, but fail to filter out an older add for the same file that lacks stats. For example, if the file was imported from a parquet or iceberg table, and we only added stats later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Thanks for your input, will note this and move on for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich For the purposes of CDF: Suppose data skipping failed to filter the add. The rows of the add data file should eventually be filtered by the predicate. So this should be fine right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like dataskipping looks exclusively at add actions, so this definitely is a separate PR. I do still wonder if data skipping could be leveraged for removes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do still wonder if data skipping could be leveraged for removes

It's so tempting... but unfortunately that older add could have been removed by a replace-table operation, and have a completely incompatible schema vs. the table's current schema (on columns other than the predicate). If that happens, the read would fail at parquet read time, before we could apply the predicate to filter out the unwanted rows.

Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 92.30769% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.69%. Comparing base (9c43bf4) to head (bd932d6).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/actions/visitors.rs 92.30% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #633      +/-   ##
==========================================
+ Coverage   83.46%   83.69%   +0.23%     
==========================================
  Files          75       75              
  Lines       16918    16950      +32     
  Branches    16918    16950      +32     
==========================================
+ Hits        14120    14187      +67     
+ Misses       2143     2098      -45     
- Partials      655      665      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small nit on that note. Besides that, LGTM!

kernel/src/table_changes/log_replay/tests.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good! just a few comments!

kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/mod.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
if let Some(path) = getters[0].get_opt(i, "remove.path")? {
self.removes.push(Self::visit_remove(i, path, getters)?);
break;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find lol!

kernel/src/actions/visitors.rs Show resolved Hide resolved
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
kernel/src/actions/visitors.rs Outdated Show resolved Hide resolved
@zachschuermann zachschuermann changed the title feat: add stat to Remove action feat: add stat to Remove action and parse partition values for Remove Jan 13, 2025
@sebastiantia sebastiantia force-pushed the add_stat_to_remove_action branch from c1bf5e1 to 9823136 Compare January 13, 2025 18:15
@sebastiantia
Copy link
Collaborator Author

sebastiantia commented Jan 14, 2025

In light of the fact that DataSkipping is not performed on Remove actions (prev discussion ref), we will be holding off from adding the stat field on Remove actions as we do not yet have a use-case it. The PR description and title should reflect this now.

This PR should now only:

  • handle partition values in Remove actions
  • remove the stray break in the implementation of RowVisitor forRemoveVisitor

@sebastiantia sebastiantia changed the title feat: add stat to Remove action and parse partition values for Remove feat: handle partition_values in Remove actions and remove break in RemoveVisitor Jan 14, 2025
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks for iterating

@sebastiantia sebastiantia force-pushed the add_stat_to_remove_action branch from 7d4643e to c9a110a Compare January 14, 2025 21:21
@sebastiantia sebastiantia force-pushed the add_stat_to_remove_action branch from c9a110a to bd932d6 Compare January 14, 2025 21:24
@sebastiantia sebastiantia changed the title feat: handle partition_values in Remove actions and remove break in RemoveVisitor feat: reading partition_values in RemoveVisitor and remove break in RowVisitor for RemoveVisitor Jan 14, 2025
@sebastiantia sebastiantia changed the title feat: reading partition_values in RemoveVisitor and remove break in RowVisitor for RemoveVisitor feat: read partition_values in RemoveVisitor and remove break in RowVisitor for RemoveVisitor Jan 14, 2025
@sebastiantia sebastiantia merged commit c1c1dbe into delta-io:main Jan 14, 2025
21 checks passed
@sebastiantia sebastiantia deleted the add_stat_to_remove_action branch January 14, 2025 21:36
@@ -375,7 +375,7 @@ pub struct Add {
/// in the added file must be contained in one or more remove actions in the same version.
pub data_change: bool,

/// Contains [statistics] (e.g., count, min/max values for columns) about the data in this logical file.
/// Contains [statistics] (e.g., count, min/max values for columns) about the data in this logical file encoded as a JSON string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late drive-by -- long line? (surprised format check didn't complain)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants