-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
estimating breadth of coverage with strobealign --aemb #456
Comments
Hi @wwood , With breadth of coverage I assume you mean getting the fraction of positions over a sequence with at least X reads mapped. It is not possible with I have heard this request from another user as well. Perhaps something to consider, although it would require significant memory overhead to store coverage arrays per contig. CC @luispedro, as aemb is really his brainchild |
Thanks for the quick response - yes you are right about what I was after. And yes, CoverM can already do this (with You can use the mosdepth data structure which is more efficient instead of an array of per-base coverages (need only to increment where the read starts mapping and decrementing where it stops, which makes it faster but not less memory intensive). I idly wonder whether that data structure could be compressed for metagenomic read mapping where there is many (and many low coverage) contigs somehow. For a start the mosdepth array could be run-length encoded, but that would mean losing the constant time insertion. So maybe some tradeoff like splitting up each 10kb segment or something, or whitelisting contigs pass a threshold so there's no further need to keep a coverage array. Another option would be to bin the contigs into say 500bp segments, and track how many of those segments are covered by a read. Not exactly the same, but would be an approximation for less memory intensity than a full coverage array and everything is constant time. For entire genomes 500kb+ it would probably be near-identical. Anyway, curious what you think Luis. |
I think it is not too difficult conceptually to extend AEMB to have some estimate of breadth. Particularly, if you are worried about extreme cases (eliminating very low coverage contigs/genomes) rather than a precise estimate of coverage (being able to distinguish 10% coverage from 8% or 15%) You can probably just check if the strobemers in the reads cover enough of the strobemers from the reference (IIRC, kraken-uniq is based on a similar idea, but it's been a while since I looked at that). I am happy to point people in the right direction, but I don't have the free cycles to write the code or test the idea, though. |
Hi,
Just wondering whether this might be possible? Or no?
I ran into this trying to use
strobealign --aemb
for calculating the coverage of a genome in CoverM - by default it requires 10% of the genome to be covered by at least 1 read, otherwise coverage is set to 0.Thanks.
The text was updated successfully, but these errors were encountered: