Non-sequential data retrieval performance (retrieval by index) #1505

MosGeo · 2023-08-19T17:19:53Z

MosGeo
Aug 19, 2023

Hi All,

This question is related to #1486 and #1479.

I am interested in using zarr for a problem that will require a lot of non-sequential retrieval of data (i.e., certain rows at a time). I've done three tests that give different timings below. You can see that the direct sequential timing is the fastest. Any guidance on how to improve the the indecies random retrieval?

The three scenarios tested:

Completely random retrieval by index.
Sequential retrieval by index.
Sequential retrieval by slicing.

I am interested in the first case in particular. Is zarrthe best option for this? How about the the storage type? My current perference is SQLiteStore as I am attaching some other tables to the same database file (for fast metadata retrieval and querying; this might be related to zarr-developers/zarr-specs#154).

Note: the chunking was chosen based on the data retrieval (i.e., I will always retrieve the whole row).

import zarr
import numpy as np

n_experiments = 20000
n_points_per_experiment = 1000

z = zarr.zeros((n_experiments, n_points_per_experiment), chunks=(1, n_points_per_experiment), order='F')
z[:] = 42
z.info

Now, lets create the sampling arrays

n_samples = 500
indecies_rand = np.random.randint(0, n_points_per_experiment, size=n_samples)
indecies_seq = np.arange(n_samples)

The final results are (not regerous testing but it holds overall):

joshmoore · 2023-09-01T08:15:59Z

joshmoore
Sep 1, 2023
Maintainer

Hi @MosGeo. Sorry for the having missed this. Without digging deeper, one idea is that the sheer number of chunks in a single folder might be causing you problems on some operating systems. Could you possible "chunk" your n_experiments to create a deeper hierarchy?

Rather than:

you'd end up with a directory structure something like:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-sequential data retrieval performance (retrieval by index) #1505

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Non-sequential data retrieval performance (retrieval by index) #1505

MosGeo Aug 19, 2023

Replies: 1 comment

joshmoore Sep 1, 2023 Maintainer

MosGeo
Aug 19, 2023

joshmoore
Sep 1, 2023
Maintainer