Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incremental Snapshot update #651

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jan 18, 2025

Note

removed the breaking change label, since calling code should not require updates.

What changes are proposed in this pull request?

This PR proposes a new api on Snapshot to handle updating a snapshot to a more recent version.

    /// Update the `Snapshot` to the target version.
    ///
    /// # Parameters
    /// - `target_version`: desired version of the `Snapshot` after update, defaults to latest.
    /// - `engine`: Implementation of [`Engine`] apis.
    ///
    /// # Returns
    /// - boolean flag indicating if the `Snapshot` was updated.
    pub fn update(
        &mut self,
        target_version: impl Into<Option<Version>>,
        engine: &dyn Engine,
    ) -> DeltaResult<bool>

This allows engines like duckdb or delta-rs to hold on to a Snapshot and avoid potentially expensive operations on the object store by only the new log entries.

This PR affects the following public APIs

  • adds a new update method to Snapshot.
  • exposes enable_type_widening on TableProperties (this was useful to use the type widening table for testing)
  • extends permissible inputs to Snapshot::try_new

How was this change tested?

  • added new tests for both LogSegment and Snapshot::update

@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 18, 2025
Signed-off-by: Robert Pack <[email protected]>
Copy link

codecov bot commented Jan 18, 2025

Codecov Report

Attention: Patch coverage is 90.34483% with 14 lines in your changes missing coverage. Please review.

Project coverage is 83.69%. Comparing base (8494126) to head (9dcc067).

Files with missing lines Patch % Lines
kernel/src/snapshot.rs 82.25% 2 Missing and 9 partials ⚠️
kernel/src/log_segment.rs 96.96% 1 Missing ⚠️
kernel/src/log_segment/tests.rs 97.87% 0 Missing and 1 partial ⚠️
kernel/src/table_properties/deserialize.rs 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #651      +/-   ##
==========================================
+ Coverage   83.66%   83.69%   +0.03%     
==========================================
  Files          75       75              
  Lines       16949    17087     +138     
  Branches    16949    17087     +138     
==========================================
+ Hits        14180    14301     +121     
- Misses       2085     2088       +3     
- Partials      684      698      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@roeap roeap removed the breaking-change Change that will require a version bump label Jan 18, 2025
@github-actions github-actions bot added the breaking-change Change that will require a version bump label Jan 18, 2025
@roeap roeap removed the breaking-change Change that will require a version bump label Jan 18, 2025
@@ -166,21 +166,59 @@ impl LogSegment {
.filter_ok(|x| x.is_commit())
.try_collect()?;

// Return a FileNotFound error if there are no new commits.
// Callers can then decide how to handle this case.
require!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting it up makes it clearer that we want nonempty 👌

// Callers can then decide how to handle this case.
require!(
!ascending_commit_files.is_empty(),
Error::file_not_found("No new commits.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the error should be more like "No commits in range", or "Failed to find commits in the range". "New commits" makes sense for an update operation, but CDF could operate over old commits.

/// The other LogSegment must be contiguous with this one, i.e. the end
/// version of this LogSegment must be one less than the start version of
/// the other LogSegment and contain only commit files.
pub(crate) fn extend(&mut self, other: &LogSegment) -> DeltaResult<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this passes by reference. I'm not sure when we'd need to keep around such a log segment besides extending an existing one. Should we pass ownership and avoid the clone?


let (metadata, protocol) = log_segment.read_metadata_opt(engine)?;
if let Some(p) = &protocol {
p.ensure_read_supported()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the column mapping mode also be checked in the case only Protocol is updated? the column_mapping_mode function takes protocol as a parameter, so I imagine whatever the output column_mapping_mode is should be re-validated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact this is another usecase for my TableConfiguration PR where we do all the validations in one place without having to think about these cases.

assert_eq!(snapshot.protocol().min_reader_version(), 1);
assert_eq!(snapshot.table_properties.enable_type_widening, None);

snapshot.update(1, &engine).unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a clean n simple api to update! 👌

@@ -137,6 +137,9 @@ pub struct TableProperties {
/// whether to enable row tracking during writes.
pub enable_row_tracking: Option<bool>,

/// whether to enable type widening
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see enableTypeWidening in the protocol. How come it's not part of the protocol doesn't have it when other docs do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants