From ef829853957d1644dd42530623423bf3d304e3e8 Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sat, 18 Jan 2025 13:57:03 +0000 Subject: [PATCH 1/6] Documentation for Location Providers --- mkdocs/docs/configuration.md | 59 ++++++++++++++++++++++++++++++++++++ pyiceberg/table/locations.py | 7 ++++- 2 files changed, 65 insertions(+), 1 deletion(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index 06eaac1bed..46a27f0177 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -54,6 +54,8 @@ Iceberg tables support table properties to configure table behavior. ### Write options +***TODO:*** Add LocationProvider-related properties here. + | Key | Options | Default | Description | | -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- | | `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | @@ -195,6 +197,63 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya +## Location Providers + +Iceberg works with the concept of a LocationProvider that determines the file paths for a table's data. PyIceberg +introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via +table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider), +which generates file paths that are optimised for object storage. + +### SimpleLocationProvider + +The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, +a non-partitioned table might have a data file with location: + +```txt +s3://my-bucket/my_table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +When data is partitioned, the files under a given partition are grouped into a subdirectory, with that partition key +and value as the directory name. For example, a table partitioned over a string column `category` might have a data file +with location: + +```txt +s3://my-bucket/my_table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. + +### ObjectStoreLocationProvider + +When several files are stored under the same prefix, cloud object stores such as S3 often [throttling requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), +resulting in slowdowns. + +The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, +into file paths, to distribute files across a larger number of object store prefixes. + +Partitions are included in file paths just before the file name, in a similar manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). +A table partitioned over a string column `category` might have a data file with location: (note the additional binary directories) + +```txt +s3://my-bucket/my_table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +The `write.object-storage.enabled` table property determines whether the ObjectStoreLocationProvider is enabled for a +table. It is used by default. + +When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which +defaults to `true`, can be set to `false` as an additional optimisation. This omits partition keys and values from data +file paths *entirely* to further reduce key size. With it disabled, the same data file above would instead be written +to: (note the absence of `category=orders`) + +```txt +s3://my-bucket/my_table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +``` + +### Loading a Custom LocationProvider + +***TODO***. Maybe link to code reference for LocationProvider? + ## Catalogs PyIceberg currently has native catalog type support for REST, SQL, Hive, Glue and DynamoDB. diff --git a/pyiceberg/table/locations.py b/pyiceberg/table/locations.py index 046ee32527..53b41d1e61 100644 --- a/pyiceberg/table/locations.py +++ b/pyiceberg/table/locations.py @@ -30,7 +30,12 @@ class LocationProvider(ABC): - """A base class for location providers, that provide data file locations for write tasks.""" + """A base class for location providers, that provide data file locations for a table's write tasks. + + Args: + table_location (str): The table's base storage location. + table_properties (Properties): The table's properties. + """ table_location: str table_properties: Properties From 3b9457010d9e2df68af6e87af5213b9c4fe46d09 Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sat, 18 Jan 2025 16:13:40 +0000 Subject: [PATCH 2/6] Finish docs --- mkdocs/docs/configuration.md | 61 +++++++++++++++++++++++++----------- 1 file changed, 42 insertions(+), 19 deletions(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index 46a27f0177..13d4cd914a 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -54,17 +54,18 @@ Iceberg tables support table properties to configure table behavior. ### Write options -***TODO:*** Add LocationProvider-related properties here. - -| Key | Options | Default | Description | -| -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- | -| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | -| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | -| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | -| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | -| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | -| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | -| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | +| Key | Options | Default | Description | +|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------| +| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | +| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | +| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | +| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | +| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | +| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | +| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | +| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | +| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | +| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | ### Table behavior options @@ -210,7 +211,7 @@ The SimpleLocationProvider places file names underneath a `data` directory in th a non-partitioned table might have a data file with location: ```txt -s3://my-bucket/my_table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` When data is partitioned, the files under a given partition are grouped into a subdirectory, with that partition key @@ -218,7 +219,7 @@ and value as the directory name. For example, a table partitioned over a string with location: ```txt -s3://my-bucket/my_table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. @@ -231,28 +232,50 @@ resulting in slowdowns. The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, into file paths, to distribute files across a larger number of object store prefixes. -Partitions are included in file paths just before the file name, in a similar manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). -A table partitioned over a string column `category` might have a data file with location: (note the additional binary directories) +Paths contain partitions just before the file name, and a `data` directory beneath the table's location, in a similar +manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). For example, a table partitioned over a string +column `category` might have a data file with location: (note the additional binary directories) ```txt -s3://my-bucket/my_table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` The `write.object-storage.enabled` table property determines whether the ObjectStoreLocationProvider is enabled for a table. It is used by default. +#### Partition Exclusion + When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which -defaults to `true`, can be set to `false` as an additional optimisation. This omits partition keys and values from data +defaults to `true`, can be set to `false` as an additional optimisation for object stores. This omits partition keys and values from data file paths *entirely* to further reduce key size. With it disabled, the same data file above would instead be written to: (note the absence of `category=orders`) ```txt -s3://my-bucket/my_table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet +s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` ### Loading a Custom LocationProvider -***TODO***. Maybe link to code reference for LocationProvider? +Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base +class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The +table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom +LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table, +permitting different location provision for different tables. + +An example, custom `LocationProvider` implementation is shown below. + +```py +import uuid + +class UUIDLocationProvider(LocationProvider): + def __init__(self, table_location: str, table_properties: Properties): + super().__init__(table_location, table_properties) + + def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str: + # Can use any custom method to generate a file path given the partitioning information and file name + prefix = f"{self.table_location}/{uuid.uuid4()}" + return f"{prefix}/{partition_key.to_path()}/{data_file_name}" if partition_key else f"{prefix}/{data_file_name}" +``` ## Catalogs From 3ee2695ba6ab3ec7cb75d1dce269114a4fb2e82e Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sun, 19 Jan 2025 11:13:36 +0000 Subject: [PATCH 3/6] Minor spelling fixes --- mkdocs/docs/configuration.md | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index 13d4cd914a..2e766f9882 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -200,10 +200,10 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya ## Location Providers -Iceberg works with the concept of a LocationProvider that determines the file paths for a table's data. PyIceberg +Iceberg works with the concept of a LocationProvider that determines file paths for a table's data. PyIceberg introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider), -which generates file paths that are optimised for object storage. +which generates file paths that are optimized for object storage. ### SimpleLocationProvider @@ -214,7 +214,7 @@ a non-partitioned table might have a data file with location: s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -When data is partitioned, the files under a given partition are grouped into a subdirectory, with that partition key +When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key and value as the directory name. For example, a table partitioned over a string column `category` might have a data file with location: @@ -222,17 +222,18 @@ with location: s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `false`. +The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table +property to `False`. ### ObjectStoreLocationProvider -When several files are stored under the same prefix, cloud object stores such as S3 often [throttling requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), +When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), resulting in slowdowns. The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, into file paths, to distribute files across a larger number of object store prefixes. -Paths contain partitions just before the file name, and a `data` directory beneath the table's location, in a similar +Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). For example, a table partitioned over a string column `category` might have a data file with location: (note the additional binary directories) @@ -246,9 +247,9 @@ table. It is used by default. #### Partition Exclusion When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which -defaults to `true`, can be set to `false` as an additional optimisation for object stores. This omits partition keys and values from data -file paths *entirely* to further reduce key size. With it disabled, the same data file above would instead be written -to: (note the absence of `category=orders`) +defaults to `True`, can be set to `False` as an additional optimization for object stores. This omits partition keys and +values from data file paths *entirely* to further reduce key size. With it disabled, the same data file above would +instead be written to: (note the absence of `category=orders`) ```txt s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet From 6be752a2d28be51f451246cd45462403355181d6 Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sun, 19 Jan 2025 22:09:29 +0000 Subject: [PATCH 4/6] Address some comments --- mkdocs/docs/configuration.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index 2e766f9882..5911cef2be 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -54,18 +54,18 @@ Iceberg tables support table properties to configure table behavior. ### Write options -| Key | Options | Default | Description | -|------------------------------------------|-----------------------------------|---------|-------------------------------------------------------------------------------------------------------------------------------------| -| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | -| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | -| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | -| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | -| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | -| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | -| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | -| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths | -| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | -| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | +| Key | Options | Default | Description | +|------------------------------------------|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | +| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | +| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | +| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | +| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | +| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | +| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | +| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths. Note: the default value of `True` differs from the Java implementation | +| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | +| `write.py-location-provider.impl` | String, e.g. `mymodule.myLocationProvider` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | ### Table behavior options From 76f397b35abaa1555ede59ad5c5a4fce8c5f1374 Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sun, 19 Jan 2025 22:37:28 +0000 Subject: [PATCH 5/6] Address all comments --- mkdocs/docs/configuration.md | 81 ++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 37 deletions(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index 5911cef2be..cd6e4a2146 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -54,18 +54,18 @@ Iceberg tables support table properties to configure table behavior. ### Write options -| Key | Options | Default | Description | -|------------------------------------------|--------------------------------------------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | -| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | -| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | -| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | -| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | -| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | -| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | -| `write.object-storage.enabled` | Boolean | True | Enables the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider) that adds a hash component to file paths. Note: the default value of `True` differs from the Java implementation | -| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | -| `write.py-location-provider.impl` | String, e.g. `mymodule.myLocationProvider` | null | Optional, [custom LocationProvider](configuration.md#loading-a-custom-locationprovider) implementation | +| Key | Options | Default | Description | +|------------------------------------------|-----------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | +| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | +| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | +| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | +| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | +| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | +| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | +| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#objectstorelocationprovider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation | +| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | +| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-locationprovider) implementation | ### Table behavior options @@ -200,53 +200,58 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya ## Location Providers -Iceberg works with the concept of a LocationProvider that determines file paths for a table's data. PyIceberg -introduces a pluggable LocationProvider module; the LocationProvider used may be specified on a per-table basis via -table properties. PyIceberg defaults to the [ObjectStoreLocationProvider](configuration.md#objectstorelocationprovider), -which generates file paths that are optimized for object storage. +Apache Iceberg uses the concept of a `LocationProvider` to manage file paths for a table's data. In PyIceberg, the +`LocationProvider` module is designed to be pluggable, allowing customization for specific use cases. The +`LocationProvider` for a table can be specified through table properties. -### SimpleLocationProvider +PyIceberg defaults to the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider), which generates +file paths that are optimized for object storage. -The SimpleLocationProvider places file names underneath a `data` directory in the table's storage location. For example, -a non-partitioned table might have a data file with location: +### Simple Location Provider + +The `SimpleLocationProvider` places a table's file names underneath a `data` directory in the table's base storage +location (this is `table.metadata.location` - see the [Iceberg table specification](https://iceberg.apache.org/spec/#table-metadata)). +For example, a non-partitioned table might have a data file with location: ```txt s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -When data is partitioned, files under a given partition are grouped into a subdirectory, with that partition key -and value as the directory name. For example, a table partitioned over a string column `category` might have a data file -with location: +When the table is partitioned, files under a given partition are grouped into a subdirectory, with that partition key +and value as the directory name - this is known as the *Hive-style* partition path format. For example, a table +partitioned over a string column `category` might have a data file with location: ```txt s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -The SimpleLocationProvider is enabled for a table by explicitly setting its `write.object-storage.enabled` table +The `SimpleLocationProvider` is enabled for a table by explicitly setting its `write.object-storage.enabled` table property to `False`. -### ObjectStoreLocationProvider +### Object Store Location Provider -When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), -resulting in slowdowns. +PyIceberg offers the `ObjectStoreLocationProvider`, and an optional [partition-exclusion](configuration.md#partition-exclusion) +optimization, designed for tables stored in object storage. For additional context and motivation concerning these configurations, +see their [documentation for Iceberg's Java implementation](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout). -The ObjectStoreLocationProvider counteracts this by injecting deterministic hashes, in the form of binary directories, +When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3), +resulting in slowdowns. The `ObjectStoreLocationProvider` counteracts this by injecting deterministic hashes, in the form of binary directories, into file paths, to distribute files across a larger number of object store prefixes. -Paths contain partitions just before the file name and a `data` directory beneath the table's location, in a similar -manner to the [SimpleLocationProvider](configuration.md#simplelocationprovider). For example, a table partitioned over a string -column `category` might have a data file with location: (note the additional binary directories) +Paths still contain partitions just before the file name, in Hive-style, and a `data` directory beneath the table's location, +in a similar manner to the [`SimpleLocationProvider`](configuration.md#simple-location-provider). For example, a table +partitioned over a string column `category` might have a data file with location: (note the additional binary directories) ```txt s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -The `write.object-storage.enabled` table property determines whether the ObjectStoreLocationProvider is enabled for a +The `write.object-storage.enabled` table property determines whether the `ObjectStoreLocationProvider` is enabled for a table. It is used by default. #### Partition Exclusion -When the ObjectStoreLocationProvider is used, the table property `write.object-storage.partitioned-paths`, which +When the `ObjectStoreLocationProvider` is used, the table property `write.object-storage.partitioned-paths`, which defaults to `True`, can be set to `False` as an additional optimization for object stores. This omits partition keys and values from data file paths *entirely* to further reduce key size. With it disabled, the same data file above would instead be written to: (note the absence of `category=orders`) @@ -257,11 +262,13 @@ s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd ### Loading a Custom LocationProvider -Similar to FileIO, a custom LocationProvider may be provided for a table by concretely subclassing the abstract base -class [LocationProvider](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). The -table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom -LocationProvider (i.e. `module.CustomLocationProvider`). Recall that a LocationProvider is configured per-table, -permitting different location provision for different tables. +Similar to FileIO, a custom `LocationProvider` may be provided for a table by concretely subclassing the abstract base +class [`LocationProvider`](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider). + +The table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom +`LocationProvider` (i.e. `mymodule.MyLocationProvider`). Recall that a `LocationProvider` is configured per-table, +permitting different location provision for different tables. Note also that Iceberg's Java implementation uses a +different table property, `write.location-provider.impl`, for custom Java implementations. An example, custom `LocationProvider` implementation is shown below. From 5ee6deca17d248cb5410c136dc03f0c3c889a227 Mon Sep 17 00:00:00 2001 From: Sreesh Maheshwar Date: Sun, 19 Jan 2025 22:49:56 +0000 Subject: [PATCH 6/6] Fix all hyperlinks --- mkdocs/docs/configuration.md | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/mkdocs/docs/configuration.md b/mkdocs/docs/configuration.md index cd6e4a2146..e076afdb93 100644 --- a/mkdocs/docs/configuration.md +++ b/mkdocs/docs/configuration.md @@ -54,18 +54,18 @@ Iceberg tables support table properties to configure table behavior. ### Write options -| Key | Options | Default | Description | -|------------------------------------------|-----------------------------------|---------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | -| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | -| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | -| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | -| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | -| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | -| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | -| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#objectstorelocationprovider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation | -| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | -| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-locationprovider) implementation | +| Key | Options | Default | Description | +|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. | +| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg | +| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group | +| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk | +| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk | +| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group | +| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. | +| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation | +| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled | +| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation | ### Table behavior options @@ -260,7 +260,7 @@ instead be written to: (note the absence of `category=orders`) s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet ``` -### Loading a Custom LocationProvider +### Loading a Custom Location Provider Similar to FileIO, a custom `LocationProvider` may be provided for a table by concretely subclassing the abstract base class [`LocationProvider`](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider).