Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GC content parameter --mingc --maxgc #30

Merged
merged 4 commits into from
Apr 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Rust implementation of [NanoFilt](https://github.com/wdecoster/nanofilt)+[NanoLyse](https://github.com/wdecoster/nanolyse), both originally written in Python. This tool, intended for long read sequencing such as PacBio or ONT, filters and trims a fastq file.
Filtering is done on average read quality and minimal or maximal read length, and applying a headcrop (start of read) and tailcrop (end of read) while printing the reads passing the filter.

Compared to the Python implementation the scope is to deliver the same results, almost the same functionality, at much faster execution times. At the moment this tool does not support filtering using a sequencing_summary file or filtering on GC content. If those features are of interest then please reach out.
Compared to the Python implementation the scope is to deliver the same results, almost the same functionality, at much faster execution times. At the moment this tool does not support filtering using a sequencing_summary file. If those features are of interest then please reach out.

## Installation

Expand Down Expand Up @@ -31,6 +31,8 @@ OPTIONS:
--threads Number of parallel threads to use [default: 4]
--contam Fasta file with reference to check potential contaminants against [default None]
-i, --input Input filename [default: read from stdin]
--maxgc Sets a maximum GC content [default: 1.0]
--mingc Sets a minimum GC content [default: 0.0]
```

EXAMPLES:
Expand All @@ -41,6 +43,8 @@ chopper -q 10 -l 500 -i reads.fastq > filtered_reads.fastq
chopper -q 10 -l 500 -i reads.fastq.gz | gzip > filtered_reads.fastq.gz
```

Note that the tool may be substantially slower in the third example above, and piping while decompressing is recommended (as in the first example).

## CITATION

If you use this tool, please consider citing our [publication](https://academic.oup.com/bioinformatics/article/39/5/btad311/7160911).
39 changes: 38 additions & 1 deletion src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,14 @@ struct Cli {
/// Input filename [default: read from stdin]
#[arg(short = 'i', long = "input", value_parser)]
input: Option<String>,

/// Filter max GC content
#[arg(long, value_parser, default_value_t = 1.0)]
maxgc: f64,

/// Filter min GC content
#[arg(long, value_parser, default_value_t = 0.0)]
mingc: f64,
}


Expand Down Expand Up @@ -128,13 +136,23 @@ where
if !record.is_empty() {
let read_len = record.seq().len();
// If a read is shorter than what is to be cropped the read is dropped entirely (filtered out)

// Check if gc content filter exist, if no gc content filter is set pass the 0.5 to pass all the follwoing filter
let read_gc = if args.mingc != 0.0 || args.maxgc != 1.0 {
cal_gc(record.seq())
} else {
0.5
};

if args.headcrop + args.tailcrop < read_len {
let average_quality = ave_qual(
&record.qual().iter().map(|i| i - 33).collect::<Vec<u8>>(),
);
if (!args.inverse
&& average_quality >= args.minqual
&& average_quality <= args.maxqual
&& read_gc >= args.mingc
&& read_gc <= args.maxgc
&& read_len >= args.minlength
&& read_len <= args.maxlength
&& !is_contamination(&record.seq(), &aligner))
Expand Down Expand Up @@ -170,13 +188,23 @@ where
if !record.is_empty() {
let read_len = record.seq().len();
// If a read is shorter than what is to be cropped the read is dropped entirely (filtered out)

// Check if gc content filter exist, if no gc content filter is set pass the 0.5 to pass all the follwoing filter
let read_gc = if args.mingc != 0.0 || args.maxgc != 1.0 {
cal_gc(record.seq())
} else {
0.5
};

if args.headcrop + args.tailcrop < read_len {
let average_quality = ave_qual(
&record.qual().iter().map(|i| i - 33).collect::<Vec<u8>>(),
);
if (!args.inverse
&& average_quality >= args.minqual
&& average_quality <= args.maxqual
&& read_gc >= args.mingc
&& read_gc <= args.maxgc
&& read_len >= args.minlength
&& read_len <= args.maxlength)
|| (args.inverse
Expand Down Expand Up @@ -253,6 +281,11 @@ fn is_contamination(readseq: &&[u8], contam: &Aligner) -> bool {
}
}

fn cal_gc(readseq: &[u8]) -> f64 {
let gc_count = readseq.iter().filter(|&&base| base == b'G' || base == b'g' || base == b'C' || base == b'c').count();
(gc_count as f64) / (readseq.len() as f64)
}

#[test]
fn test_ave_qual() {
assert_eq!(ave_qual(&[10]), 10.0);
Expand Down Expand Up @@ -289,6 +322,8 @@ fn test_filter() {
contam: None,
inverse: false,
input: None,
mingc: 0.0,
maxgc: 1.0,
},
);
}
Expand Down Expand Up @@ -331,7 +366,9 @@ fn test_filter_with_contam() {
threads: 1,
contam: Some("test-data/random_contam.fa".to_owned()),
inverse: false,
input: None,
input: None,
mingc: 0.0,
maxgc: 1.0,
},
);
}
Expand Down
44 changes: 44 additions & 0 deletions test-data/testGC.fastq
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
@GC0
AAAAAAAAAA
+
OOOOOOOOOO
@GC10
GAAAAAAAAA
+
OOOOOOOOOO
@GC20
GCAAAAAAAA
+
OOOOOOOOOO
@GC30
GCGAAAAAAA
+
OOOOOOOOOO
@GC40
GCGCAAAAAA
+
OOOOOOOOOO
@GC50
GCGCGAAAAA
+
OOOOOOOOOO
@GC60
GCGCGCAAAA
+
OOOOOOOOOO
@GC40
GCGCGCGAAA
+
OOOOOOOOOO
@GC80
GCGCGCGCAA
+
OOOOOOOOOO
@GC90
GCGCGCGCGA
+
OOOOOOOOOO
@GC100
GCGCGCGCGC
+
OOOOOOOOOO
Loading