-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyIceberg Production Use case survey #1202
Comments
Hey, actually I'm using in production for small datasets in combination with duckdb specially to avoid small files with webscrapping. For ingestion, reading many raw files (json, csv, and parquet), all off then with a key using ulid (sortable id is necessary) in combination with overwrite specifying this key as overwrite filter. Because of the sortable id, it's possible to use the the filter predicate overwriting the data between upper and lower bound the data set to be ingested. The table maintenance still using spark for expiring snapshot. To avoid small files, after certain period using the duckdb native iceberg read, I reload the entire dataset and overwrite it fully (a workaround for rewrite files procedure) I would love to expand it for more scenarios but some features are necessary like
These pipelines are leaving from spark server and running on isolated containers. |
Using pyiceberg alongside Trino. Our ETL is in Trino, pyiceberg Is great for assets where we are doing things like grabbing data from APIs. Instead of storing files and crawling them with something like glue into iceberg tables, we can directly write that data into iceberg so that our Trino pipelines can process it directly, super convenient! |
I use it mainly for testing xtable conversion from iceberg to delta, it is by the far the easiest way to generate Iceberg tables :) |
Currently using PyIceberg for monitoring metadata statistics of Iceberg tables in a custom application (e.g. file count, record count, data distribution across partitions). We periodically compute these statistics and write them to Postgres and hook it up to Grafana. This gives us a better idea how to optimize Iceberg tables further (e.g. partition layout). In the long run we would like to use PyIceberg as a low-cost alternative to Glue streaming (possibly with AWS Lambda or Quix-Streams inside of Fargate). This is especially interesting for applications that are low-volume in data but have harder requirements on timeliness of data compared to batch jobs. Here are some example use cases:
Thanks for the great work that the Iceberg community is doing. |
I use it to mirror tables from one catalog to another all the time. I have scheduled production jobs that do this mirroring before and after my dbt builds. Pyiceberg is just the best library. |
Maybe we can enable Discussions like https://github.com/apache/iceberg-rust/discussions for this purpose |
👋 We have a few use cases at Yelp using pyiceberg. We use a lot of Spark, Flink, and Athena but have some types of use cases where pyiceberg is a good fit. One example is daily batches that fetch data from 3rd party APIs and need to store the output in the lake so that other engines can read it. These apps are often a single process and Spark isn’t needed. This pattern of use cases previously used our streaming ingest pipeline via Kafka and Flink, but pyiceberg is simpler, costs less and supports schema evolution better. |
So, do you now let the app directly write to Iceberg? |
@vikramsg i updated my comment a bit, but yes that’s right |
Feature Request / Improvement
As part of the journey toward version 1.0, we want to capture how this library is used in "production" environments.
Would love to hear from current users (and potential users) on different use cases. This will better inform the future roadmap.
Please include use cases in this issue, or if necessary I can start a Google Survey.
The text was updated successfully, but these errors were encountered: