You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The chunks parameter is exposed in many functions throughout the API and controls how chunks of data in the underlying zarr storage are mapped onto chunks of computation in dask.
It is a key parameter affecting computational performance and scalability.
Currently the default value of this parameter is "native" in all functions, meaning that zarr chunks are mapped directly to dask chunks.
In some cases this can lead to poor performance when running functions that require a large scan across a SNP genotype array, and when running on a distributed cluster. For example, running a PCA across all Ag3 samples on chromosome 3RL. The main problem is that the zarr genotype data is chunked relatively small, around 30MB per chunk, and this leads to a large number of tasks in the dask graph. The size and/or complexity of this graph then causes problems for the dask scheduler.
Through experimentation I've found that mapping 10 zarr chunks to 1 dask chunk, with the zarr chunks being joined along the first (variants) dimension, leads to much better dask performance on a distributed cluster for these large genotype scans.
However, this only applies to "call" arrays like SNP genotype data. We do not want to do the same for "variant" arrays such as position or alleles, as this leads to memory problems.
Suggest that we provide support for setting separate defaults for "variant" and "call" arrays. The "variant" arrays would retain a default "native" everywhere. The "call" arrays would have a different default when called from functions likely to perform a large scan, such as PCA or GWSS.
The text was updated successfully, but these errors were encountered:
The
chunks
parameter is exposed in many functions throughout the API and controls how chunks of data in the underlying zarr storage are mapped onto chunks of computation in dask.It is a key parameter affecting computational performance and scalability.
Currently the default value of this parameter is "native" in all functions, meaning that zarr chunks are mapped directly to dask chunks.
In some cases this can lead to poor performance when running functions that require a large scan across a SNP genotype array, and when running on a distributed cluster. For example, running a PCA across all Ag3 samples on chromosome 3RL. The main problem is that the zarr genotype data is chunked relatively small, around 30MB per chunk, and this leads to a large number of tasks in the dask graph. The size and/or complexity of this graph then causes problems for the dask scheduler.
Through experimentation I've found that mapping 10 zarr chunks to 1 dask chunk, with the zarr chunks being joined along the first (variants) dimension, leads to much better dask performance on a distributed cluster for these large genotype scans.
However, this only applies to "call" arrays like SNP genotype data. We do not want to do the same for "variant" arrays such as position or alleles, as this leads to memory problems.
Suggest that we provide support for setting separate defaults for "variant" and "call" arrays. The "variant" arrays would retain a default "native" everywhere. The "call" arrays would have a different default when called from functions likely to perform a large scan, such as PCA or GWSS.
The text was updated successfully, but these errors were encountered: