-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add table statistics #1285
Add table statistics #1285
Conversation
6f0bee0
to
a70edb2
Compare
a70edb2
to
384e229
Compare
I plan to move the set/remove statistics methods from the Transaction class to another class, such as ManageSnapshot. In the meantime, I’d like to confirm with everyone if I’m heading in the right direction with the current implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Added a few comments. I think it would also be helpful to include integration tests
9b15c86
to
d16ef47
Compare
@kevinjqliu could you please review it once more? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments.
Do you know which engine currently can generate puffin files? would be great to add an integration with a spark generated puffin file
@kevinjqliu As far as I know, only Trino can generate them. What kind of test would you like to have? I believe we are covering all relevant cases for this PR. If PyIceberg could generate or read puffin files, then I agree it would be useful to add tests to check compatibility between engines. However, I think it only makes sense to test puffin files during reading, as testing generation would mean verifying the implementation of something that isn’t our responsibility. In this case, it’s just a metadata update. What do you think? |
d16ef47
to
11120bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for working on this!
Regarding the integration tests, since we're manipulating table metadata to add/remove table stats, it would be great to verify that another source can interact with these stats. Not a hard blocker
@ndrluis do you mind resolving the merge conflict here? |
11120bf
to
adfc971
Compare
@kevinjqliu Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ndrluis Thanks for working on this, and sorry for leaving this hanging for so long. I have some small comments, but it looks good to me 👍
if update.snapshot_id != update.statistics.snapshot_id: | ||
raise ValueError("Snapshot id in statistics does not match the snapshot id in the update") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of an awkward check, but something that we have to live with I guess.
2034836
to
ba64764
Compare
8fe9992
to
217bb95
Compare
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
217bb95
to
45c60cb
Compare
45c60cb
to
fb9b2a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thanks @ndrluis 🙌
The Java expire snapshot process expires table statistics and partition statistics. I am implementing a statistics table to make our expire snapshot compatible with the Java implementation.