Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better defaults for chunks parameters #629

Open
alimanfoo opened this issue Sep 26, 2024 · 0 comments
Open

Better defaults for chunks parameters #629

alimanfoo opened this issue Sep 26, 2024 · 0 comments

Comments

@alimanfoo
Copy link
Member

The chunks parameter is exposed in many functions throughout the API and controls how chunks of data in the underlying zarr storage are mapped onto chunks of computation in dask.

It is a key parameter affecting computational performance and scalability.

Currently the default value of this parameter is "native" in all functions, meaning that zarr chunks are mapped directly to dask chunks.

In some cases this can lead to poor performance when running functions that require a large scan across a SNP genotype array, and when running on a distributed cluster. For example, running a PCA across all Ag3 samples on chromosome 3RL. The main problem is that the zarr genotype data is chunked relatively small, around 30MB per chunk, and this leads to a large number of tasks in the dask graph. The size and/or complexity of this graph then causes problems for the dask scheduler.

Through experimentation I've found that mapping 10 zarr chunks to 1 dask chunk, with the zarr chunks being joined along the first (variants) dimension, leads to much better dask performance on a distributed cluster for these large genotype scans.

However, this only applies to "call" arrays like SNP genotype data. We do not want to do the same for "variant" arrays such as position or alleles, as this leads to memory problems.

Suggest that we provide support for setting separate defaults for "variant" and "call" arrays. The "variant" arrays would retain a default "native" everywhere. The "call" arrays would have a different default when called from functions likely to perform a large scan, such as PCA or GWSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants