Better defaults for chunks parameters #629

alimanfoo · 2024-09-26T16:48:24Z

The chunks parameter is exposed in many functions throughout the API and controls how chunks of data in the underlying zarr storage are mapped onto chunks of computation in dask.

It is a key parameter affecting computational performance and scalability.

Currently the default value of this parameter is "native" in all functions, meaning that zarr chunks are mapped directly to dask chunks.

In some cases this can lead to poor performance when running functions that require a large scan across a SNP genotype array, and when running on a distributed cluster. For example, running a PCA across all Ag3 samples on chromosome 3RL. The main problem is that the zarr genotype data is chunked relatively small, around 30MB per chunk, and this leads to a large number of tasks in the dask graph. The size and/or complexity of this graph then causes problems for the dask scheduler.

Through experimentation I've found that mapping 10 zarr chunks to 1 dask chunk, with the zarr chunks being joined along the first (variants) dimension, leads to much better dask performance on a distributed cluster for these large genotype scans.

However, this only applies to "call" arrays like SNP genotype data. We do not want to do the same for "variant" arrays such as position or alleles, as this leads to memory problems.

Suggest that we provide support for setting separate defaults for "variant" and "call" arrays. The "variant" arrays would retain a default "native" everywhere. The "call" arrays would have a different default when called from functions likely to perform a large scan, such as PCA or GWSS.

The text was updated successfully, but these errors were encountered:

leehart added the performance label Sep 27, 2024

jonbrenas mentioned this issue Jan 8, 2025

Better defaults for chunks parameters #704

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better defaults for chunks parameters #629

Better defaults for chunks parameters #629

alimanfoo commented Sep 26, 2024

Better defaults for chunks parameters #629

Better defaults for chunks parameters #629

Comments

alimanfoo commented Sep 26, 2024