Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyIceberg Production Use case survey #1202

Open
kevinjqliu opened this issue Sep 24, 2024 · 9 comments
Open

PyIceberg Production Use case survey #1202

kevinjqliu opened this issue Sep 24, 2024 · 9 comments

Comments

@kevinjqliu
Copy link
Contributor

Feature Request / Improvement

As part of the journey toward version 1.0, we want to capture how this library is used in "production" environments.

Would love to hear from current users (and potential users) on different use cases. This will better inform the future roadmap.

Please include use cases in this issue, or if necessary I can start a Google Survey.

@kevinjqliu kevinjqliu pinned this issue Sep 24, 2024
@mariotaddeucci
Copy link

Hey, actually I'm using in production for small datasets in combination with duckdb specially to avoid small files with webscrapping.

For ingestion, reading many raw files (json, csv, and parquet), all off then with a key using ulid (sortable id is necessary) in combination with overwrite specifying this key as overwrite filter.
Duckdb generates a record_batach_reader, which allows to generate the table and schema without load all in memory, after creating the table is necessary to converte into a arrow table to write the final iceberg table.

Because of the sortable id, it's possible to use the the filter predicate overwriting the data between upper and lower bound the data set to be ingested.

The table maintenance still using spark for expiring snapshot.

To avoid small files, after certain period using the duckdb native iceberg read, I reload the entire dataset and overwrite it fully (a workaround for rewrite files procedure)

I would love to expand it for more scenarios but some features are necessary like

  • allow to write using record_batch_reader, so no need to load a full arrow table in memory.
  • clear snapshots from pyiceberg, that's turns the maintenance easier, no external engine or tool
  • maybe a simple optimization like binpack, is not the best but it's better than read all and overwrite it.
  • Maybe an integration with duckdb, just taking the last metada location and creating a view on it using their native iceberg reader
  • a truly merge operation, so avoiding errors when doing upserts, making not necessary to use the upper and lower bound of DF key as overwrite filter.

These pipelines are leaving from spark server and running on isolated containers.

@andreapiso
Copy link

Using pyiceberg alongside Trino. Our ETL is in Trino, pyiceberg Is great for assets where we are doing things like grabbing data from APIs. Instead of storing files and crawling them with something like glue into iceberg tables, we can directly write that data into iceberg so that our Trino pipelines can process it directly, super convenient!

@djouallah
Copy link

I use it mainly for testing xtable conversion from iceberg to delta, it is by the far the easiest way to generate Iceberg tables :)

@emorfam
Copy link

emorfam commented Nov 13, 2024

Currently using PyIceberg for monitoring metadata statistics of Iceberg tables in a custom application (e.g. file count, record count, data distribution across partitions). We periodically compute these statistics and write them to Postgres and hook it up to Grafana. This gives us a better idea how to optimize Iceberg tables further (e.g. partition layout).

In the long run we would like to use PyIceberg as a low-cost alternative to Glue streaming (possibly with AWS Lambda or Quix-Streams inside of Fargate). This is especially interesting for applications that are low-volume in data but have harder requirements on timeliness of data compared to batch jobs. Here are some example use cases:

  • Processing assembly-trees in manufacturing that change over time.
  • Ingesting sensor data from production plants that can contain duplicate messages.

MERGE support would be really helpful here. I guess handling the amount of data that is being loaded from target table during the MERGE operation (e.g. with push-down predicates) will be the biggest obstacle.

Thanks for the great work that the Iceberg community is doing.

@randypitcherii
Copy link

I use it to mirror tables from one catalog to another all the time. I have scheduled production jobs that do this mirroring before and after my dbt builds.

Pyiceberg is just the best library.

@manuzhang
Copy link
Contributor

Maybe we can enable Discussions like https://github.com/apache/iceberg-rust/discussions for this purpose

@nickdelnano
Copy link

nickdelnano commented Jan 7, 2025

👋 We have a few use cases at Yelp using pyiceberg. We use a lot of Spark, Flink, and Athena but have some types of use cases where pyiceberg is a good fit. One example is daily batches that fetch data from 3rd party APIs and need to store the output in the lake so that other engines can read it. These apps are often a single process and Spark isn’t needed.

This pattern of use cases previously used our streaming ingest pipeline via Kafka and Flink, but pyiceberg is simpler, costs less and supports schema evolution better.

@vikramsg
Copy link

vikramsg commented Jan 8, 2025

These apps used to write messages to Kafka and then Flink would stream them to our data lake, but pyiceberg is simpler, costs less and supports schema evolution better.

So, do you now let the app directly write to Iceberg?

@nickdelnano
Copy link

nickdelnano commented Jan 8, 2025

@vikramsg i updated my comment a bit, but yes that’s right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants