estimating breadth of coverage with strobealign --aemb #456

wwood · 2024-11-12T03:53:40Z

Hi,

Just wondering whether this might be possible? Or no?

I ran into this trying to use strobealign --aemb for calculating the coverage of a genome in CoverM - by default it requires 10% of the genome to be covered by at least 1 read, otherwise coverage is set to 0.

Thanks.

The text was updated successfully, but these errors were encountered:

ksahlin · 2024-11-12T06:11:14Z

Hi @wwood ,

With breadth of coverage I assume you mean getting the fraction of positions over a sequence with at least X reads mapped.

It is not possible with strobealign --aemb. The only way I see this possible to compute is if running strobealign in default mode (to get a SAM file) and then run a downstream tool to do this computation (maybe CoverM has such a feature?).

I have heard this request from another user as well. Perhaps something to consider, although it would require significant memory overhead to store coverage arrays per contig.

CC @luispedro, as aemb is really his brainchild

wwood · 2024-11-12T06:35:46Z

Thanks for the quick response - yes you are right about what I was after.

And yes, CoverM can already do this (with -m covered_fraction or covered_bases). But as you say, this requires not using --aemb so SAM is output, so a bit slower.

You can use the mosdepth data structure which is more efficient instead of an array of per-base coverages (need only to increment where the read starts mapping and decrementing where it stops, which makes it faster but not less memory intensive).

I idly wonder whether that data structure could be compressed for metagenomic read mapping where there is many (and many low coverage) contigs somehow. For a start the mosdepth array could be run-length encoded, but that would mean losing the constant time insertion. So maybe some tradeoff like splitting up each 10kb segment or something, or whitelisting contigs pass a threshold so there's no further need to keep a coverage array.

Another option would be to bin the contigs into say 500bp segments, and track how many of those segments are covered by a read. Not exactly the same, but would be an approximation for less memory intensity than a full coverage array and everything is constant time. For entire genomes 500kb+ it would probably be near-identical.

Anyway, curious what you think Luis.

luispedro · 2024-11-12T06:58:45Z

I think it is not too difficult conceptually to extend AEMB to have some estimate of breadth. Particularly, if you are worried about extreme cases (eliminating very low coverage contigs/genomes) rather than a precise estimate of coverage (being able to distinguish 10% coverage from 8% or 15%)

You can probably just check if the strobemers in the reads cover enough of the strobemers from the reference (IIRC, kraken-uniq is based on a similar idea, but it's been a while since I looked at that).

I am happy to point people in the right direction, but I don't have the free cycles to write the code or test the idea, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estimating breadth of coverage with strobealign --aemb #456

estimating breadth of coverage with strobealign --aemb #456

wwood commented Nov 12, 2024

ksahlin commented Nov 12, 2024

wwood commented Nov 12, 2024

luispedro commented Nov 12, 2024

estimating breadth of coverage with strobealign --aemb #456

estimating breadth of coverage with strobealign --aemb #456

Comments

wwood commented Nov 12, 2024

ksahlin commented Nov 12, 2024

wwood commented Nov 12, 2024

luispedro commented Nov 12, 2024