docs: add link to preprint

mbhall88 · Dec 3, 2024 · 17285e8 · 17285e8
1 parent b0cba22
commit 17285e8
Show file tree

Hide file tree

Showing 4 changed files with 27 additions and 13 deletions.
diff --git a/README.md b/README.md
@@ -2,13 +2,14 @@
 
 [![check](https://github.com/mbhall88/lrge/actions/workflows/check.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/check.yml)
 [![test](https://github.com/mbhall88/lrge/actions/workflows/test.yml/badge.svg)](https://github.com/mbhall88/lrge/actions/workflows/test.yml)
+[![DOI:10.1101/2024.11.27.625777](https://img.shields.io/badge/citation-10.1101/2024.11.27.625777-blue)][doi]
 
 **L**ong **R**ead-based **G**enome size **E**stimation from overlaps
 
 LRGE (pronounced "large") is a command line tool for estimating genome size from long read overlaps. The tool is built 
 on top of the [`liblrge`][liblrge] Rust library, which is also available as a standalone library for use in other projects.
 
-> PREPRINT/PAPER COMING SOON
+> Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:[10.1101/2024.11.27.625777][doi].
 
 ## Table of Contents
 
@@ -169,7 +170,7 @@ $ cat size.txt
 By default, LRGE uses the [two-set strategy](#two-set-strategy) with 10,000 target reads (`-T`) and 5,000 query reads 
 (`-Q`). You can use the [all-vs-all strategy](#all-vs-all-strategy) by specifying the number of reads to use with the `-n` flag.
 
-In the paper, we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster* 
+In [the paper][doi], we ran LRGE on three eukaoryotic genomes: *Arabidopsis thaliana* (125 Mbp), *Drosophila melanogaster* 
 (143 Mbp), and *Saccharomyces cerevisiae* (12 Mbp). We used 50,000 query and 100,000 target reads for *A. thaliana* and 
 *D. melanogaster*, and 10,000 query and 20,000 target reads for *S. cerevisiae*.
 
@@ -433,15 +434,24 @@ You can find the full details of how we compared methods in the [workflow](./pap
 
 ## Citation
 
-If you use LRGE in your research, please cite the following paper:
+If you use LRGE in your research, please cite the following [paper][doi]:
 
 ```bibtex
-COMING SOON
+@article{hall_genome_2024,
+	title = {Genome size estimation from long read overlaps},
+	url = {https://biorxiv.org/content/early/2024/12/02/2024.11.27.625777.abstract},
+	doi = {10.1101/2024.11.27.625777},
+	journal = {bioRxiv},
+	author = {Hall, Michael B and Coin, Lachlan J M},
+	month = jan,
+	year = {2024},
+	pages = {2024.11.27.625777},
+}
 ```
 
 [apptainer]: https://github.com/apptainer/apptainer
 [docker]: https://docs.docker.com/
-[doi]: https://doi.org/TODO
+[doi]: https://doi.org/10.1101/2024.11.27.625777
 [ghcr]: https://github.com/mbhall88/lrge/pkgs/container/lrge
 [liblrge]: https://www.docs.rs/liblrge
 [quay.io]: https://quay.io/repository/mbhall88/lrge
diff --git a/liblrge/src/estimate.rs b/liblrge/src/estimate.rs
@@ -26,19 +26,18 @@ pub trait Estimate {
     /// A `Vec<f32>` containing the generated estimates. These estimates may be finite or infinite.
     fn generate_estimates(&mut self) -> crate::Result<(Vec<f32>, u32)>;
 
-    // todo add link to paper
     /// Generate an estimate of the genome size, taking the median of the per-read estimates.
     ///
     /// # Arguments
     ///
     /// * `finite`: Whether to consider only finite estimates. We found setting this to `true` gave
-    ///   more accurate results (see the paper).
+    ///   more accurate results (see [the paper][doi]).
     /// * `lower_quant`: The lower percentile to calculate. If `None`, this will not be calculated.
     ///   This value should be between 0 and 0.5. So, for the 25th percentile, you would pass `0.25`.
     /// * `upper_quant`: The upper percentile to calculate. If `None`, this will not be calculated.
     ///   This value should be between 0.5 and 1.0. So, for the 75th percentile, you would pass `0.75`.
     ///
-    /// In our analysis, we found that the 15th and 65th percentiles gave the highest confidence (~92%).
+    /// In [our analysis][doi], we found that the 15th and 65th percentiles gave the highest confidence (~92%).
     /// If you want to use our most current recommended values, you can use the constants [`LOWER_QUANTILE`]
     /// and [`UPPER_QUANTILE`]. You can of course use any values you like.
     ///
@@ -51,6 +50,8 @@ pub trait Estimate {
     ///
     /// The estimate will be `None` if there are no finite estimates when `finite` is `true`, or if
     /// there are no estimates at all.
+    ///
+    /// [doi]: https://doi.org/10.1101/2024.11.27.625777
     fn estimate(
         &mut self,
         finite: bool,
@@ -130,13 +131,14 @@ fn calculate_quantile(data: &[f32], quantile: f32) -> Option<f32> {
     }
 }
 
-// todo add link to paper
-/// Estimate genome size using the formula from Equation 3 in the paper.
+/// Estimate genome size using the formula from Equation 3 in [the paper][doi].
 ///
 /// # Returns
 ///
 /// A floating point number representing the estimated genome size. If the number of overlaps is 0,
 /// this function will return [`f32::INFINITY`].
+///
+/// [doi]: https://doi.org/10.1101/2024.11.27.625777
 pub(crate) fn per_read_estimate(
     read_len: usize,
     avg_target_len: f32,

diff --git a/liblrge/src/lib.rs b/liblrge/src/lib.rs
@@ -124,7 +124,7 @@
 //!
 //! [log]: https://crates.io/crates/log
 //! [env_logger]: https://crates.io/crates/env_logger
-// todo add link to paper
+//! [doi]: https://doi.org/10.1101/2024.11.27.625777
 #[deny(missing_docs)]
 pub mod ava;
 pub mod error;

diff --git a/paper/README.md b/paper/README.md
@@ -1,4 +1,4 @@
-This directory contains all the code and data used in the paper. The code is organised in the following way:
+This directory contains all the code and data used in [the paper][doi]. The code is organised in the following way:
 
 - [`workflow/`](./workflow): Contains the code to reproduce the analyses of the paper. It requires [Snakemake](https://snakemake.readthedocs.io/en/stable/) to run.
 - [`config/`](./config): Contains the configuration files for the workflow as well as the metadata for the samples. The final dataset used in the 
@@ -8,4 +8,6 @@ This directory contains all the code and data used in the paper. The code is org
 - [`scripts/`](./scripts): Miscellaneous scripts used for the paper, but directly part of the workflow.
 - [`notebooks/`](./notebooks): Jupyter notebooks used for the paper. These are not part of the workflow, but were used to generate figures and tables.
 - [`results/`](./results): Contains the results of the workflow. The final estimates used in the paper are available in [`results/estimates/estimates.tsv`](./results/estimates/estimates.tsv). 
-    The figures and tables for the paper are available in [`results/figures/`](./results/figures) and [`results/tables/`](./results/tables), respectively.
+    The figures and tables for the paper are available in [`results/figures/`](./results/figures) and [`results/tables/`](./results/tables), respectively.
+
+[doi]: https://doi.org/10.1101/2024.11.27.625777