Skip to content

Commit

Permalink
docs: add link to preprint
Browse files Browse the repository at this point in the history
  • Loading branch information
mbhall88 committed Dec 3, 2024
1 parent b0cba22 commit 17285e8
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 13 deletions.
20 changes: 15 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

[![check](https://github.com/mbhall88/lrge/actions/workflows/check.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/check.yml)
[![test](https://github.com/mbhall88/lrge/actions/workflows/test.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/test.yml)
[![DOI:10.1101/2024.11.27.625777](https://img.shields.io/badge/citation-10.1101/2024.11.27.625777-blue)][doi]

**L**ong **R**ead-based **G**enome size **E**stimation from overlaps

LRGE (pronounced "large") is a command line tool for estimating genome size from long read overlaps. The tool is built
on top of the [`liblrge`][liblrge] Rust library, which is also available as a standalone library for use in other projects.

> PREPRINT/PAPER COMING SOON
> Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:[10.1101/2024.11.27.625777][doi].
## Table of Contents

Expand Down Expand Up @@ -169,7 +170,7 @@ $ cat size.txt
By default, LRGE uses the [two-set strategy](#two-set-strategy) with 10,000 target reads (`-T`) and 5,000 query reads
(`-Q`). You can use the [all-vs-all strategy](#all-vs-all-strategy) by specifying the number of reads to use with the `-n` flag.

In the paper, we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster*
In [the paper][doi], we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster*
(143 Mbp), and *Saccharomyces cerevisiae* (12 Mbp). We used 50,000 query and 100,000 target reads for *A. thaliana* and
*D. melanogaster*, and 10,000 query and 20,000 target reads for *S. cerevisiae*.

Expand Down Expand Up @@ -433,15 +434,24 @@ You can find the full details of how we compared methods in the [workflow](./pap

## Citation

If you use LRGE in your research, please cite the following paper:
If you use LRGE in your research, please cite the following [paper][doi]:

```bibtex
COMING SOON
@article{hall_genome_2024,
title = {Genome size estimation from long read overlaps},
url = {https://biorxiv.org/content/early/2024/12/02/2024.11.27.625777.abstract},
doi = {10.1101/2024.11.27.625777},
journal = {bioRxiv},
author = {Hall, Michael B and Coin, Lachlan J M},
month = jan,
year = {2024},
pages = {2024.11.27.625777},
}
```

[apptainer]: https://github.com/apptainer/apptainer
[docker]: https://docs.docker.com/
[doi]: https://doi.org/TODO
[doi]: https://doi.org/10.1101/2024.11.27.625777
[ghcr]: https://github.com/mbhall88/lrge/pkgs/container/lrge
[liblrge]: https://www.docs.rs/liblrge
[quay.io]: https://quay.io/repository/mbhall88/lrge
12 changes: 7 additions & 5 deletions liblrge/src/estimate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,19 +26,18 @@ pub trait Estimate {
/// A `Vec<f32>` containing the generated estimates. These estimates may be finite or infinite.
fn generate_estimates(&mut self) -> crate::Result<(Vec<f32>, u32)>;

// todo add link to paper
/// Generate an estimate of the genome size, taking the median of the per-read estimates.
///
/// # Arguments
///
/// * `finite`: Whether to consider only finite estimates. We found setting this to `true` gave
/// more accurate results (see the paper).
/// more accurate results (see [the paper][doi]).
/// * `lower_quant`: The lower percentile to calculate. If `None`, this will not be calculated.
/// This value should be between 0 and 0.5. So, for the 25th percentile, you would pass `0.25`.
/// * `upper_quant`: The upper percentile to calculate. If `None`, this will not be calculated.
/// This value should be between 0.5 and 1.0. So, for the 75th percentile, you would pass `0.75`.
///
/// In our analysis, we found that the 15th and 65th percentiles gave the highest confidence (~92%).
/// In [our analysis][doi], we found that the 15th and 65th percentiles gave the highest confidence (~92%).
/// If you want to use our most current recommended values, you can use the constants [`LOWER_QUANTILE`]
/// and [`UPPER_QUANTILE`]. You can of course use any values you like.
///
Expand All @@ -51,6 +50,8 @@ pub trait Estimate {
///
/// The estimate will be `None` if there are no finite estimates when `finite` is `true`, or if
/// there are no estimates at all.
///
/// [doi]: https://doi.org/10.1101/2024.11.27.625777
fn estimate(
&mut self,
finite: bool,
Expand Down Expand Up @@ -130,13 +131,14 @@ fn calculate_quantile(data: &[f32], quantile: f32) -> Option<f32> {
}
}

// todo add link to paper
/// Estimate genome size using the formula from Equation 3 in the paper.
/// Estimate genome size using the formula from Equation 3 in [the paper][doi].
///
/// # Returns
///
/// A floating point number representing the estimated genome size. If the number of overlaps is 0,
/// this function will return [`f32::INFINITY`].
///
/// [doi]: https://doi.org/10.1101/2024.11.27.625777
pub(crate) fn per_read_estimate(
read_len: usize,
avg_target_len: f32,
Expand Down
2 changes: 1 addition & 1 deletion liblrge/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
//!
//! [log]: https://crates.io/crates/log
//! [env_logger]: https://crates.io/crates/env_logger
// todo add link to paper
//! [doi]: https://doi.org/10.1101/2024.11.27.625777
#[deny(missing_docs)]
pub mod ava;
pub mod error;
Expand Down
6 changes: 4 additions & 2 deletions paper/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This directory contains all the code and data used in the paper. The code is organised in the following way:
This directory contains all the code and data used in [the paper][doi]. The code is organised in the following way:

- [`workflow/`](./workflow): Contains the code to reproduce the analyses of the paper. It requires [Snakemake](https://snakemake.readthedocs.io/en/stable/) to run.
- [`config/`](./config): Contains the configuration files for the workflow as well as the metadata for the samples. The final dataset used in the
Expand All @@ -8,4 +8,6 @@ This directory contains all the code and data used in the paper. The code is org
- [`scripts/`](./scripts): Miscellaneous scripts used for the paper, but directly part of the workflow.
- [`notebooks/`](./notebooks): Jupyter notebooks used for the paper. These are not part of the workflow, but were used to generate figures and tables.
- [`results/`](./results): Contains the results of the workflow. The final estimates used in the paper are available in [`results/estimates/estimates.tsv`](./results/estimates/estimates.tsv).
The figures and tables for the paper are available in [`results/figures/`](./results/figures) and [`results/tables/`](./results/tables), respectively.
The figures and tables for the paper are available in [`results/figures/`](./results/figures) and [`results/tables/`](./results/tables), respectively.

[doi]: https://doi.org/10.1101/2024.11.27.625777

0 comments on commit 17285e8

Please sign in to comment.