Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

estimating breadth of coverage with strobealign --aemb #456

Open
wwood opened this issue Nov 12, 2024 · 3 comments
Open

estimating breadth of coverage with strobealign --aemb #456

wwood opened this issue Nov 12, 2024 · 3 comments

Comments

@wwood
Copy link

wwood commented Nov 12, 2024

Hi,

Just wondering whether this might be possible? Or no?

I ran into this trying to use strobealign --aemb for calculating the coverage of a genome in CoverM - by default it requires 10% of the genome to be covered by at least 1 read, otherwise coverage is set to 0.

Thanks.

@ksahlin
Copy link
Owner

ksahlin commented Nov 12, 2024

Hi @wwood ,

With breadth of coverage I assume you mean getting the fraction of positions over a sequence with at least X reads mapped.

It is not possible with strobealign --aemb. The only way I see this possible to compute is if running strobealign in default mode (to get a SAM file) and then run a downstream tool to do this computation (maybe CoverM has such a feature?).

I have heard this request from another user as well. Perhaps something to consider, although it would require significant memory overhead to store coverage arrays per contig.

CC @luispedro, as aemb is really his brainchild

@wwood
Copy link
Author

wwood commented Nov 12, 2024

Thanks for the quick response - yes you are right about what I was after.

And yes, CoverM can already do this (with -m covered_fraction or covered_bases). But as you say, this requires not using --aemb so SAM is output, so a bit slower.

You can use the mosdepth data structure which is more efficient instead of an array of per-base coverages (need only to increment where the read starts mapping and decrementing where it stops, which makes it faster but not less memory intensive).

I idly wonder whether that data structure could be compressed for metagenomic read mapping where there is many (and many low coverage) contigs somehow. For a start the mosdepth array could be run-length encoded, but that would mean losing the constant time insertion. So maybe some tradeoff like splitting up each 10kb segment or something, or whitelisting contigs pass a threshold so there's no further need to keep a coverage array.

Another option would be to bin the contigs into say 500bp segments, and track how many of those segments are covered by a read. Not exactly the same, but would be an approximation for less memory intensity than a full coverage array and everything is constant time. For entire genomes 500kb+ it would probably be near-identical.

Anyway, curious what you think Luis.

@luispedro
Copy link
Contributor

I think it is not too difficult conceptually to extend AEMB to have some estimate of breadth. Particularly, if you are worried about extreme cases (eliminating very low coverage contigs/genomes) rather than a precise estimate of coverage (being able to distinguish 10% coverage from 8% or 15%)

You can probably just check if the strobemers in the reads cover enough of the strobemers from the reference (IIRC, kraken-uniq is based on a similar idea, but it's been a while since I looked at that).

I am happy to point people in the right direction, but I don't have the free cycles to write the code or test the idea, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants