ZCollection #995
fbriol
started this conversation in
Show and tell
ZCollection
#995
Replies: 1 comment
-
I wonder how this interacts with the possibility of implementing irregular chunk sizes for zarr (as dask.array does, but without depending on it)? This feature will be required for using zarr as a storage format for awkward and sparse data types. At first glance, I would say that the two are orthogonal and complementary, but I am not certain.
Clarification: this was done by Hive well before pyarrow existed, and s supported by all parquet readers such as spark and fastparquet . |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
To process the measurement of the future SWOT mission, we decided to use the Zarr storage format. This data format allows us to obtain excellent processing performance when parallelizing processing on a laptop or a cluster with Dask to scale our processing.
Zarr allows resizing the shape of the stored tensors to concatenate new data into a tensor without rewriting the entire stored data. On the other hand, if we perform an update that requires reorganizing the data (insertion of new data), it will be necessary to copy the existing data after modifying the tensor shape to update the data we want.
This problem also exists when using the Parquet format to store tabular data. The PyArrow library has solved this problem by introducing a partitioned dataset or multiple files.
The ZCollection library does the same using Zarr as the storage format.
We have implemented partitioning by date (hour, day, month, etc.) or by sequence (to divide the satellite measurement by complete orbit). A collection partitioned by date, with a monthly resolution, may look like on the disk:
It's possible to set the partition update strategy. Now, two options exist:
It's possible to create views on a reference collection, to add and modify variables contained in a reference collection, accessible in reading only.
This library can store data on POSIX, S3, or any other file system supported by "fsspec."
It has examples of using the library available here.
Beta Was this translation helpful? Give feedback.
All reactions