RNA sequencing (RNA-Seq) employs high-troughtput sequencing technologies to unravel the transcriptomic profile of living organisms. The raw sequencing dataset requires a series of elegant and sofisticated pieces of bioinformatics softwares to gather usefull and insightful information on biological data. Here I'll describe a basic RNA-Seq pipeline using snakemake workflow language for automation.
The softwares required to run this analysis are contained in a conda environment. If you don't have anaconda installed in your machine, a basic tutorial can be foud here. Otherwise, you can simply import an enviroment that I've already made using:
conda env create --file environment.yaml # Create rnaseq environment
conda activate rnaseq # enters the recently created environment
This command will use the enviroment.yaml
file and create a enviroment called rnaseq
containing all the programs you'll need.
The RNA-Seq data that we're going to use is described in Sousa et al., 2019. In this paper, the autors describes DEG of T cells stimulated by different versions of OKT3, an anti-CD3 antibody. This data is pubicly available in SRA with the study code SRP139131
. For matters of simplification we'll only use the control and the T cells treated with OKT3 data. However, feel free to use the entire dataset or any other piece of data that may be of your interest.
Use the following command to create our (sub)directories:
mkdir -p {trimData,rawData,qcData}
Now, inside rawData
directory, download our work data from SRA using the following command (In case of having personal data at dispose, ignore this step):
cat SRR_Acc_List.txt | parallel "fastq-dump --gzip --split-files {}"
The reference genome inside kallisto
directory needs to be unzipped (Once this file is unzip, there will be no need to unzip it again):
gunzip *.gz
To run our analysis we'll need to execute the snakefile
containing all the RNA-Seq steps. In your terminal, use:
snakemake -s snakefile -j 8
snakemake
command will execute our snakefile
file. The number of cores after -j
will depend on your machine or server CPU capability.
- Before executing
snakefile
ensure that all the data and directories needed are as described earlier. - If you try to use your own personal data, be aware that the
config.yaml
file needs to be modified. You should only insert the files name/id, instead of something like{sample_name}_1.fastq.gz
. Any doubt, just look at theconfig.yaml
in this repository. - There is also the need to set the full paths where the directories are in your system in the
config.yaml
file. If it's not set correctly, the input files and their respective outputs won't be found or sent to the right place, acording to thesnakefile
code logic. Any doubts, look theconfig.yaml
in this repository. - If you are interested to only run a specific step of the
snakefile
, just specify the name of the rule you are interest to run. Example:
snakemake --core 8 fastqc