Skip to content

2. Trimming and filtering

Tobias Hofmann edited this page Feb 17, 2016 · 3 revisions

clean_reads.py

Trimmomatic

Within this step we clean our read data from adapter contamination and apply a quality filter to our read-data. Thereby we ensure to only process clean reads of sufficient quality.
For this purpose this pipeline implements Trimmomatic, which needs to be pre-installed in order to run this script. You need to tell the script where Trimmomatic is installed on your local machine or the computer-cluster you are using for this operation. The common path for trimmomatic in the anaconda environment is the following:
/usr/local/anaconda/jar/trimmomatic.jar
This path is set as the default for the the script clean_reads.py. If the path alters from that default, you can set it by adding the flag --trimmomatic to the command, followed by the path to the Trimmomatic executable.

Config file

In order to perform the adapter trimming, you will need to provide the adapter sequences that were used during library preparation and a list of all sample IDs and their respective barcode sequences. You will have to put all of that information into a config file which you will feed the script in this step.
The config file is a simple text file that contains the following sections:

General set-up

[adapters]
In this section you give the full sequence of the two Illumina adapters (i7 and i5) and mark the position where the sample-specific barcode is inserted with an asterisk [*].
[names]
This section contains a separate line for each sample that you want to process. Each line contains the ID of the sample as it is referred to in the fastq-filename followed by a colon [:] and then the delimiter which follows the sample ID in the fastq-filename (in most cases that is underscore [_]).
[barcodes]
In the third and last section you provide each samples barcodes, each barcode in a new line. Each line contains the information in which adapter (i7 or i5) the barcode has to be inserted, followed by a minus [-] and the sample ID, then a colon [:] and the respective barcode. In other words: adapter-sampleID:barcodesequence.
This layout works for both, single and double indexed libraries.

Example

This is an example for a config file for two samples (A11 and A12) for a double indexed library:

[adapters]
i7:GATCGGAAGAGCACACGTCTGAACTCCAGTCAC*ATCTCGTATGCCGTCTTCTGCTTG
i5:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT*GTGTAGATCTCGGTGGTCGCCGTATCATT
[names]
A11:_
A12:_
[barcodes]
i5-A11:CGACACTT
i5-A12:ACCGACAA
i7-A11:AGCGTGTA
i7-A12:AACGCCTT

Run the script

You can check the available options/flags by calling the help-function for the script:
python2.7 clean_reads.py --h

Part of the screen output of the help function shows you the correct syntax for this script:
usage: clean_reads.py [-h] --input INPUT --config CONFIG --output OUTPUT [--read_min READ_MIN] [--index {single,double}] [--trimmomatic TRIMMOMATIC]

The flags in square brackets are optional additions to the command, all other flags are mandatory.

Now, run the script and make sure that you give the correct path to the input-folder (--input) and the config file (--config). If required add the trimmomatic path with the --trimmomatic flag (see paragraph above) and the desired read-threshold (--read_min). If you are using double-indexed reads you will need to specify that in the command by setting the --index flag to double. The default setting is single-indexed (--index single).

Example:

python2.7 clean_reads.py --input path/to/fastq-folder/ --config path/to/config.txt --output path/to/cleaned-reads-folder --read_min 150000 --index double --trimmomatic /installation/path/of/trimmomatic.jar