Some quick benchmarking of the sharded format #1338
Replies: 2 comments 1 reply
-
It would be very informative for someone to run these tests using a non-local storage system (eg. lustre, nfs, s3). Testing with a local filesystem largely eliminates the latency problem associated with small files, which is vastly amplified with network or internet storage. |
Beta Was this translation helpful? Give feedback.
-
great benchmarks @rabernat, thanks for showing this data. My opinion on the concurrency issue is that low-level synchronous IO in zarr like the loop you linked to should probably be replaced with an asynchronous API even for unsharded, "vanilla" stores. If zarr already implements batched asynchronous IO, then it shouldn't be too hard to augment that with the sharding information. |
Beta Was this translation helpful? Give feedback.
-
I was very excited that #1111 has landed, and I finally decided to take sharding for a spin. My goal was to understand how the sharded format performs under various read / write scenarios. @jstriebel this is all probably very obvious to you, but hopefully a fresh (naive?) perspective is helpful.
My baseline expectation was that the sharded format would be slower for large read / write operations but faster for small random access.
Experiments
These were all run on my Mac M2 laptop with an SSD.
Baseline: No Sharding, 32 MB chunks
This creates a 1.6 GB array with 32 MB chunks. Pretty standard for our use cases.
I tested three operations: write the whole arrays, read the whole array, and randomly access a single element.
Small Chunks (320 KB)
This is also a standard Zarr array, but with 100x smaller chunks. I expected this to be slower for large I/O operations, but faster for a small read.
Sharded
This creates an array with the same small chunks, but packed 100 chunks into a shard.
As predicted, the random access to the sharded format is very fast. The full read is also on par with the small chunks without shards. But the write is the real problem.
Thoughts on optimizations
Writing is extremely inefficient
Writing the whole array is extremely slow. I understand why this is. Each tiny chunk needs to write the entire shard and regenerate the index! That's because the
Array
level of Zarr is totally unaware of the existence of the storage transformer. It is trying to write each small chunk in a loop via this code pathzarr-python/zarr/core.py
Line 1820 in 4dc6f1f
It would be much, much better if the higher layers of the stack could recognize that they should batch together all of the chunks in a shard and write them in one go, perhaps in a streaming manner.
A similar optimization could be made for reading; if we know that we want all the chunks in a shard, we only need to read the index once. This could matter a lot for high-latency stores.
So the big question is: do we need to make
array
aware of the shard / chunk hierarchy to realize such optimizations?.Concurrency and Multithreading
In zarr-python, we leave concurrency to either the outer library (e.g. Dask calling Zarr) or the inner library (e.g. Zarr calling fsspec async wrappes). Now that we have to manage chunks within shards, it seems like we might want to reconsider this design choice. If we manage the concurrency around chunks-within-shards, we might be able to provide some major performance enhancements.
Beta Was this translation helpful? Give feedback.
All reactions