RNA-Seq Quantification Pipelines

This repository contains two flexible RNA-Seq quantification pipelines using Kallisto and Salmon, as well as comparison scripts to analyze outputs from both tools.

Overview

These pipelines automate the process of RNA-Seq quantification using either Kallisto or Salmon. They include steps for:

Cleaning and checking input CDS FASTA files
Building indices
Quantifying gene expression
Generating quality control reports
Creating TPM (Transcripts Per Million) and CPM (Counts Per Million) matrices

Additionally, comparison scripts are provided to analyze and visualize the differences between Kallisto and Salmon results.

Requirements

Bash (version 4.0 or later)
Conda or Mamba (for easy installation of Kallisto and Salmon)
Python 3.6+ (for the comparison scripts)
Python libraries: pandas, matplotlib, seaborn, scipy, scikit-learn, numpy, biopython

Installation

Installing Conda and Mamba

If you don't have Conda installed:

Download and install Miniconda from the official website.
After installation, initialize Conda for your shell:
```
conda init
```
Close and reopen your terminal for the changes to take effect.

To install Mamba (a faster alternative to Conda):

Install Mamba in your base Conda environment:
```
conda install mamba -n base -c conda-forge
```

Setting up the Environment

Choose either Conda or Mamba for the following steps. Mamba is generally faster.

Create a new environment for the RNA-Seq pipelines:

Using Conda:

conda create -n rnaseq_pipelines python=3.9

Using Mamba:

mamba create -n rnaseq_pipelines python=3.9

Activate the new environment:
```
conda activate rnaseq_pipelines
```

Install Kallisto and Salmon in the environment:

Using Conda:

conda install -c bioconda kallisto salmon

Using Mamba:

mamba install -c bioconda kallisto salmon

Install the required Python libraries:

Using Conda:

conda install pandas matplotlib seaborn scipy scikit-learn numpy biopython

Using Mamba:

mamba install pandas matplotlib seaborn scipy scikit-learn numpy biopython

Clone this repository:

git clone https://github.com/kamalmdmostafa/kallisto_salmon_RNA_Seq.git
cd kallisto_salmon_RNA_Seq

Make the pipeline scripts executable:

chmod +x complete-kallisto-pipeline.sh complete-salmon-pipeline.sh

Now you have an environment with Kallisto, Salmon, and all the necessary Python libraries installed.

Usage

Remember to activate your environment before running the pipelines:

conda activate rnaseq_pipelines

Kallisto Pipeline

./complete-kallisto-pipeline.sh -cds <CDS file> -k <k-mer number> -threads <number of threads> [-o <output directory>] [-r <read file directory>] [-b <bootstrap samples>]

Options:

-cds: Path to the CDS FASTA file
-k: K-mer size for indexing
-threads: Number of threads to use
-o: (Optional) Output directory (default: './kallisto')
-r: (Optional) Directory containing read files (default: current directory)
-b: (Optional) Number of bootstrap samples (default: 100)

Example:

./complete-kallisto-pipeline.sh -cds path/to/cds.fasta -k 31 -threads 8 -o /path/to/custom/output -r /path/to/read/files -b 200

Salmon Pipeline

./complete-salmon-pipeline.sh -cds <CDS file> -k <k-mer number> -threads <number of threads> [-genome <genome file>] [-vb] [-o <output directory>] [-r <read file directory>] [-b <bootstrap samples>]

Options:

-cds: Path to the CDS FASTA file
-k: K-mer size for indexing
-threads: Number of threads to use
-genome: (Optional) Path to the genome FASTA file for decoy-aware indexing
-vb: (Optional) Use Variational Bayesian Optimization in quantification
-o: (Optional) Output directory (default: './salmon')
-r: (Optional) Directory containing read files (default: current directory)
-b: (Optional) Number of bootstrap samples (default: 200)

Example:

./complete-salmon-pipeline.sh -cds path/to/cds.fasta -k 31 -threads 8 -genome path/to/genome.fasta -vb -o /path/to/custom/output -r /path/to/read/files -b 300

Comparison Scripts

After running both the Kallisto and Salmon pipelines, you can use the comparison scripts to analyze the results:

Basic comparison script:

python3 compare_quantification_tools_kallisto_and_salmon.py

Alternative comparison script with additional analyses:

python3 compare_quantification_tools_kallisto_and_salmon_alternative.py

Both scripts will prompt you to enter:

The path to the first tool's results directory
The path to the second tool's results directory
The path to save the comparison results

The scripts will then generate various comparisons and visualizations. The alternative version produces some extra comparisons and may take longer to run, depending on the number of quantification files in your directories.

Output

Both pipelines will create a directory (specified by the -o option or defaulting to kallisto or salmon) containing:

Cleaned CDS FASTA file
Index files
Quantification results for each sample
QC report (qc_report.txt)
TPM matrix (kallisto_tpm_matrix.tsv or salmon_tpm_matrix.tsv)
CPM matrix (kallisto_cpm_matrix.tsv or salmon_cpm_matrix.tsv)

The comparison scripts will generate:

An Excel file with detailed comparison statistics
Various plots including scatter plots, density plots, Bland-Altman plots, and MA plots
Correlation heatmaps
Transcript detection comparison plots
Mapping rate comparison plots
Violin plots for TPM, Count, and EffLength distributions
Cumulative TPM distribution plot

Troubleshooting

Ensure you have activated the environment (conda activate rnaseq_pipelines) before running the pipelines or comparison scripts.
If you encounter issues with Conda, try using Mamba for faster and more reliable package installation.
Check that input files are in the correct format (FASTA for CDS/genome, gzipped FASTQ for reads).
For any errors, check the log files in the output directory.
If you encounter permission issues, make sure the scripts are executable (chmod +x script_name.sh).
For issues with the comparison scripts, ensure all required Python libraries are installed and up-to-date in your environment.
If the scripts can't find your read files, make sure you're using the -r option to specify the correct directory.
If you're having memory issues during the comparison, try running the script on a subset of your data first.

Contact

For any questions or issues, please open an issue on this GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-Seq Quantification Pipelines

Table of Contents

Overview

Requirements

Installation

Installing Conda and Mamba

Setting up the Environment

Usage

Kallisto Pipeline

Salmon Pipeline

Comparison Scripts

Output

Troubleshooting

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
compare_quantification_tools_kallisto_and_salmon.py		compare_quantification_tools_kallisto_and_salmon.py
compare_quantification_tools_kallisto_and_salmon_alternative.py		compare_quantification_tools_kallisto_and_salmon_alternative.py
complete-kallisto-pipeline.sh		complete-kallisto-pipeline.sh
complete-salmon-pipeline.sh		complete-salmon-pipeline.sh

kamalmdmostafa/kallisto_salmon_RNA_Seq

Folders and files

Latest commit

History

Repository files navigation

RNA-Seq Quantification Pipelines

Table of Contents

Overview

Requirements

Installation

Installing Conda and Mamba

Setting up the Environment

Usage

Kallisto Pipeline

Salmon Pipeline

Comparison Scripts

Output

Troubleshooting

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages