ZCollection #995

fbriol · 2022-03-30T07:43:06Z

fbriol
Mar 30, 2022

To process the measurement of the future SWOT mission, we decided to use the Zarr storage format. This data format allows us to obtain excellent processing performance when parallelizing processing on a laptop or a cluster with Dask to scale our processing.

Zarr allows resizing the shape of the stored tensors to concatenate new data into a tensor without rewriting the entire stored data. On the other hand, if we perform an update that requires reorganizing the data (insertion of new data), it will be necessary to copy the existing data after modifying the tensor shape to update the data we want.

This problem also exists when using the Parquet format to store tabular data. The PyArrow library has solved this problem by introducing a partitioned dataset or multiple files.

The ZCollection library does the same using Zarr as the storage format.

We have implemented partitioning by date (hour, day, month, etc.) or by sequence (to divide the satellite measurement by complete orbit). A collection partitioned by date, with a monthly resolution, may look like on the disk:

    collection/
    ├── year=2022
    │    ├── month=01/
    │    │    ├── time/
    │    │    │    ├── 0.0
    │    │    │    ├── .zarray
    │    │    │    └── .zattrs
    │    │    ├── var1/
    │    │    │    ├── 0.0
    │    │    │    ├── .zarray
    │    │    │    └── .zattrs
    │    │    ├── .zattrs
    │    │    ├── .zgroup
    │    │    └── .zmetadata
    │    └── month=02/
    │         ├── time/
    │         │    ├── 0.0
    │         │    ├── .zarray
    │         │    └── .zattrs
    │         ├── var1/
    │         │    ├── 0.0
    │         │    ├── .zarray
    │         │    └── .zattrs
    │         ├── .zattrs
    │         ├── .zgroup
    │         └── .zmetadata
    └── .zcollection

It's possible to set the partition update strategy. Now, two options exist:

overwrite the old partition with the new one or,
update the time series only with the updated measurements and keep the old ones to complete the partition data.

It's possible to create views on a reference collection, to add and modify variables contained in a reference collection, accessible in reading only.

This library can store data on POSIX, S3, or any other file system supported by "fsspec."

It has examples of using the library available here.

martindurant · 2022-03-30T17:19:44Z

martindurant
Mar 30, 2022
Maintainer

I wonder how this interacts with the possibility of implementing irregular chunk sizes for zarr (as dask.array does, but without depending on it)? This feature will be required for using zarr as a storage format for awkward and sparse data types. At first glance, I would say that the two are orthogonal and complementary, but I am not certain.

The PyArrow library has solved this problem [for parquet] by introducing a partitioned dataset of multiple files

Clarification: this was done by Hive well before pyarrow existed, and s supported by all parquet readers such as spark and fastparquet .

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZCollection #995

{{title}}

Replies: 1 comment

{{title}}

Select a reply

ZCollection #995

fbriol Mar 30, 2022

Replies: 1 comment

martindurant Mar 30, 2022 Maintainer

fbriol
Mar 30, 2022

martindurant
Mar 30, 2022
Maintainer