Some quick benchmarking of the sharded format #1338

rabernat · 2023-02-02T16:11:14Z

rabernat
Feb 2, 2023
Maintainer

I was very excited that #1111 has landed, and I finally decided to take sharding for a spin. My goal was to understand how the sharded format performs under various read / write scenarios. @jstriebel this is all probably very obvious to you, but hopefully a fresh (naive?) perspective is helpful.

My baseline expectation was that the sharded format would be slower for large read / write operations but faster for small random access.

Experiments

These were all run on my Mac M2 laptop with an SSD.

Baseline: No Sharding, 32 MB chunks

This creates a 1.6 GB array with 32 MB chunks. Pretty standard for our use cases.

I tested three operations: write the whole arrays, read the whole array, and randomly access a single element.

import os

os.environ["ZARR_V3_EXPERIMENTAL_API"] = "1"
os.environ["ZARR_V3_SHARDING"] = "1"

import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3

dtype = 'i8'
shape = (2_000, 100_000) # 1.6 GB
chunks = (2_000, 2_000) # 32 MB

base_store = DirectoryStoreV3("base.zarr")
arr = zarr.create(shape, chunks=chunks, dtype=dtype, store=base_store)

%time arr[:] = np.random.randint(0, 1000, size=shape)
# CPU times: user 2.06 s, sys: 540 ms, total: 2.6 s
# Wall time: 1.75 s

%time _ = arr[:]
# CPU times: user 1.43 s, sys: 354 ms, total: 1.78 s
# Wall time: 811 ms

%time _ = arr[-1, -1]
# CPU times: user 35.8 ms, sys: 44.6 ms, total: 80.4 ms
# Wall time: 59.6 ms

Small Chunks (320 KB)

This is also a standard Zarr array, but with 100x smaller chunks. I expected this to be slower for large I/O operations, but faster for a small read.

small_chunks = (200, 200) # 320 KB

small_chunks_store = DirectoryStoreV3("small_chunks.zarr")
arr_sm = zarr.create(shape, chunks=small_chunks, dtype=dtype, store=small_chunks_store)

%time arr_sm[:] = np.random.randint(0, 1000, size=shape)
# CPU times: user 2.14 s, sys: 1.02 s, total: 3.16 s
# Wall time: 3.57 s

%time _ = arr_sm[:]
# CPU times: user 1.44 s, sys: 300 ms, total: 1.74 s
# Wall time: 2.27 s

%time _ = arr_sm[-1, -1]
# CPU times: user 427 µs, sys: 12.5 ms, total: 13 ms
# Wall time: 13 ms

Sharded

This creates an array with the same small chunks, but packed 100 chunks into a shard.

from zarr._storage.v3_storage_transformers import ShardingStorageTransformer

chunks_per_shard = (10, 10)

sharded_store = DirectoryStoreV3("sharded.zarr")
sharding_transformer = ShardingStorageTransformer("indexed", chunks_per_shard=chunks_per_shard)

arr_sh = zarr.create(
    shape,
    chunks=small_chunks,
    dtype=dtype,
    store=sharded_store,
    storage_transformers=[sharding_transformer],
)

%time arr_sh[:] = np.random.randint(0, 1000, size=shape)
# CPU times: user 24.4 s, sys: 20.5 s, total: 44.9 s
# Wall time: 47.7 s

%time _ = arr_sh[:]
# CPU times: user 1.58 s, sys: 3.14 s, total: 4.72 s
# Wall time: 5.27 s

%time _ = arr_sh[-1,-1]
# CPU times: user 2.52 ms, sys: 4.62 ms, total: 7.14 ms
# Wall time: 5.03 ms

As predicted, the random access to the sharded format is very fast. The full read is also on par with the small chunks without shards. But the write is the real problem.

Thoughts on optimizations

Writing is extremely inefficient

Writing the whole array is extremely slow. I understand why this is. Each tiny chunk needs to write the entire shard and regenerate the index! That's because the Array level of Zarr is totally unaware of the existence of the storage transformer. It is trying to write each small chunk in a loop via this code path

zarr-python/zarr/core.py

Line 1820 in 4dc6f1f

# iterate over chunks in range

It would be much, much better if the higher layers of the stack could recognize that they should batch together all of the chunks in a shard and write them in one go, perhaps in a streaming manner.

A similar optimization could be made for reading; if we know that we want all the chunks in a shard, we only need to read the index once. This could matter a lot for high-latency stores.

So the big question is: do we need to make array aware of the shard / chunk hierarchy to realize such optimizations?.

Concurrency and Multithreading

In zarr-python, we leave concurrency to either the outer library (e.g. Dask calling Zarr) or the inner library (e.g. Zarr calling fsspec async wrappes). Now that we have to manage chunks within shards, it seems like we might want to reconsider this design choice. If we manage the concurrency around chunks-within-shards, we might be able to provide some major performance enhancements.

carlilek · 2023-02-02T16:37:06Z

carlilek
Feb 2, 2023

It would be very informative for someone to run these tests using a non-local storage system (eg. lustre, nfs, s3). Testing with a local filesystem largely eliminates the latency problem associated with small files, which is vastly amplified with network or internet storage.

1 reply

rabernat Feb 2, 2023
Maintainer Author

That's a great idea. To be clear, my goal here is not comprehensive benchmarking, just a quick order-of-magnitude test.

d-v-b · 2023-02-02T18:06:23Z

d-v-b
Feb 2, 2023
Maintainer

great benchmarks @rabernat, thanks for showing this data. My opinion on the concurrency issue is that low-level synchronous IO in zarr like the loop you linked to should probably be replaced with an asynchronous API even for unsharded, "vanilla" stores. If zarr already implements batched asynchronous IO, then it shouldn't be too hard to augment that with the sharding information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some quick benchmarking of the sharded format #1338

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Some quick benchmarking of the sharded format #1338

rabernat Feb 2, 2023 Maintainer

Experiments

Baseline: No Sharding, 32 MB chunks

Small Chunks (320 KB)

Sharded

Thoughts on optimizations

Writing is extremely inefficient

Concurrency and Multithreading

Replies: 2 comments · 1 reply

carlilek Feb 2, 2023

rabernat Feb 2, 2023 Maintainer Author

d-v-b Feb 2, 2023 Maintainer

rabernat
Feb 2, 2023
Maintainer

Replies: 2 comments 1 reply

carlilek
Feb 2, 2023

rabernat Feb 2, 2023
Maintainer Author

d-v-b
Feb 2, 2023
Maintainer