Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of core PCA #607

Open
alimanfoo opened this issue Sep 17, 2024 · 1 comment
Open

Out of core PCA #607

alimanfoo opened this issue Sep 17, 2024 · 1 comment
Labels
scalability Issues preventing running functions on larger datasets.

Comments

@alimanfoo
Copy link
Member

Running a PCA on the whole of the Ag3 dataset is fairly challenging currently. There are three main sections to the computation:

  1. Scan genotypes to compute allele counts - allows handling of max_missing_an and min_minor_ac parameters.
  2. Scan genotypes and prepare biallelic diplotypes - this is the data that the PCA will actually run on.
  3. Run the SVD.

The first two steps can be run on a dask cluster, which helps to scale out the computation. However, step 3 currently runs an SVD via scipy which is in-core. The computation is parallelised over threads via the linear algebra backend (e.g., blas) but I had to use a machine with 64 vCPUs to get the SVD to finish in reasonable time (~13 minutes).

Dask also has several out-of-core implementations of SVD, and sgkit implements some of these too, so we could add support for this somehow.

@alimanfoo
Copy link
Member Author

alimanfoo commented Sep 17, 2024

Also related to the limit on the number of elements supported via in-core SVD described here. I.e., if SVD is computed via dask then this limit would probably not apply.

@leehart leehart added the scalability Issues preventing running functions on larger datasets. label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scalability Issues preventing running functions on larger datasets.
Projects
None yet
Development

No branches or pull requests

2 participants