You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This connector should be documented more clearly for dangerous caveats, namely:
It stores its Kafka topic offsets as ancillary data within the Iceberg snapshot data (snapshot summary). Committing new data to your tables requires the kafka offset to be >= the offsets stored within the connector-created snapshot summary data stored in iceberg.
This can be seen in the Coordinator#commitToTable() method
`
This is very problematic in the following scenarios:
If one does not persist Kafka state across invocations (e.g. docker/k8s without specifying volumes for kafka data) then upon restart this connector will not begin committing data until after the offset is >= to that stored in the iceberg snapshot summary. So, if you commit up to, say, offset 10,000 for a given table, then restart kafka, there won't be a new iceberg commit made for that table until after you get past 10,000. Worse still, this is only filtered on commit (i.e. after your actual data has been written to iceberg), meaning you'll end up with 10,000+ orphaned files in iceberg.
When using this connector, it should probably be the only ingestion source to iceberg. If you have multiple ingestion sources to iceberg, these sources won't be using the same semantics as this connector, and clearly won't be writing this extra snapshot summary data. This is less of an issue because the connector sees the absence of this special snapshot summary (i.e. minOffset == null) as an affirmative signal for commit, but this is definitely a design-smell
It looks like a fair number of the currently outstanding issues for this repo center on these design anomalies. It would definitely be wise to document this.
One solution for those environments not persisting kafka state would be to have a boolean setting that could be set such that the connector disregards the iceberg persisted offsets for its initial commit (this would solve for (1) above, but not (2)).
Otherwise, this connector is a great solution!
The text was updated successfully, but these errors were encountered:
The better design would be to switch to uuid v7 as commit id. In iceberg snapshot store just uuid, without offset.
Then filter only datafiles events with uuid greater than stored.
This connector should be documented more clearly for dangerous caveats, namely:
It stores its Kafka topic offsets as ancillary data within the Iceberg snapshot data (snapshot summary). Committing new data to your tables requires the kafka offset to be >= the offsets stored within the connector-created snapshot summary data stored in iceberg.
This can be seen in the Coordinator#commitToTable() method
`
`
This is very problematic in the following scenarios:
It looks like a fair number of the currently outstanding issues for this repo center on these design anomalies. It would definitely be wise to document this.
One solution for those environments not persisting kafka state would be to have a boolean setting that could be set such that the connector disregards the iceberg persisted offsets for its initial commit (this would solve for (1) above, but not (2)).
Otherwise, this connector is a great solution!
The text was updated successfully, but these errors were encountered: