- We are using Python 2.7. You will only need the folder BorreliaPipeline to run the pipeline. This can be done in the following
git
commands:
git init
git remote add origin https://github.com/WGS-TB/MLST
git fetch origin
git checkout origin/master -- BorreliaPipeline
-
SRA toolkit: To download sample reads. Please download the toolkit here: https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/. You will need
fastq-dump
to download the sample reads. Our samples are downloaded usingfastq-dump
v2.8.2. Copyfastq-dump
into your bin folder (ours: /usr/local/bin) or provide the path tofastq-dump
todownload_samples.py
script. The python script calls thefastq-dump
command to download sample reads. -
Bowtie v0.12.7: For read mapping. Please download it here: https://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/. You will need both
bowtie
andbowtie-build
, copy these into your bin folder or provide the path to the folder containingbowtie
andbowtie-build
tomapSamples.py
script. The python script callsbowtie
andbowtie-build
commands to map reads. -
CPLEX Python API: For solving ILP. You will need academic license to download CPLEX. Once you installed CPLEX, follow the instructions here to install CPLEX-Python modules: https://www.ibm.com/support/knowledgecenter/SSSA5P_12.5.1/ilog.odms.cplex.help/CPLEX/GettingStarted/topics/set_up/Python_setup.html
-
ART: For simulating reads. Please download it here: https://www.niehs.nih.gov/research/resources/software/biostatistics/art/. You will need
art_illumina
, copy this into your bin file or provide the path forart_illumina
torun_sim.py
. This script runs the allele simulation. -
Samtools v1.3.1: For extracting information from sam files. Please download it here: https://sourceforge.net/projects/samtools/files/samtools/1.3.1/. You will need
samtools
, copy it into your bin folder or provide the path tosamtools
tomapSamples.py
. The python script calls thesamtools
command while extracting information from sam files after mapping. -
In computing edit distance, we used the package by https://github.com/aflc/editdistance and we hereby acknowledge Hiroyuki Tanaka, where the algorithm is proposed by Heikki Hyyrö, "Explaining and extending the bit-parallel approximate string matching algorithm of Myers", (2001). Possible Python packages you will need: numpy, pandas, matplotlib, linecache, argparse . Please use
pip install
to install these packages if you do not have any of the packages. -
We also use the package PyEMD for EMD calculations, where we hereby acknowledge Gary Doran for his PyEMD package: https://github.com/garydoranjr/pyemd.
-
We are not sure if we cover all the Python packages here, please let us know if we miss any!
-
The required scripts and files are in
pipeline
folder inBorreliaPipeline
folder (except for sample reads as the files are big). -
Run
python download_samples.py [-h] [-c CMD]
in thepipeline
folder to download the samples' reads (downloaded samples are based on the SRR_Acc_List.txt).download_samples.py
has an optional argument-c
, which sets the path forfastq-dump
command needed to download the reads for Borrelia samples. By default, it assumesfastq-dump
is in your bin folder. After running this script, it will create adata
folder which contains folders for each sample. -
Since we are using Bowtie to map the sample reads, we run
python mapSamples.py [-h] [-c NUMOFCORES] [-b BOWTIE] [-s SAMTOOLS]
to map the sample reads.-c
specifies the number of core for running Bowtie, default is 4.-b
specifies the path to the folder containing two commandsbowtie
andbowtie-build
, default assumes these two commands are in the user's bin folder.-s
specifies the path ofsamtools
, default assumes the command is in user's bin folder. It will be easier to understand with a working example:python mapSamples.py -c 2 -b ../bowtieFolder/ -s ../../samtools
, for example. The script will create avariantsAndProp
folder containing a folder for each sample, where each of these folder contains a list ofsampleX_geneY_paired_reads.txt
text files, in which the text file contains information needed for our first stage ILP pipeline. These information are extracted from.sam
file. -
To solve the allele diversity problem, run the following script in the pipeline folder:
python alleleDiversity.py [-h] [-s SAMPLE]
If you want to run on a specific sample, you can indicate using -s
. Otherwise it will run on all samples in variantsAndProp
. For example, you could run python alleleDiversity.py -s SRR2034333
. This script will output some intermediate results to the screen and create 8 CSV files (for each locus) in each of the sample folder in variantsAndProp
, named as geneX_proportions.csv
which contains the alleles and their proportions identified at geneX
.
The script will output some intermediate results for you to keep track.
- To solve the global strain diversity problem, run the following script:
python globalStrainDiversity.py [-h] [-o OUTPUT] [-oc OBJECTIVECOMPONENT] [-timelim TIMELIMIT] [-g GAP] [-pathToDistMat PTDM]
Specify -o
, the folder name for the results to be stored. Default will be strainsAndProp
. The -oc
argument specifies different formulation of the objective function, 'all' includes all components of the objective function, 'noPropAndErr' omits the proportion and error component. -g
specifies the relative gap in % to stop the solver, default is 5%. -timelim
specifies the maximum time limit in seconds before stopping the solver, default is 600 seconds. The program will stop if and only if -g
and -timelim
are both satisfied. For example, if -g
is 5 and timelim
is 600, if the relative gap is above 5% and the 600 seconds have passed, the program will not stop until gap of 5% is reached. -pathToDistMat
is the folder name containing the distance matrices for each gene, which is provided and named 'editDist'.
The script will output some intermediate results for you to keep track. It will create a new folder in your current directory with name specified by -o
. The folder contains a list of sampleX_strainsAndProportions.csv
with sampleX corresponds a particular sample, and the csv file contains the strains and their proportions identified in sampleX.
-
The required scripts and files to run the simulation are in the
alleleSimulation
folder in theBorreliaPipeline
folder. -
In the
alleleSimulation
folder, you only have to run
python run_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-e EDITDIST] [-fp FULLPAPER] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
-i
specifies the number of simulations on each gene, default is 40. -f
specifies the name of the folder to store the results, default is simulation_results. -c
specifies the coverage to test on, default is 30. -b
specifies the path to the folder containing bowtie
and bowtie-build
commands, default assumes that both commands are in user's bin folder. -s
specifies the path to samtools
, default assumes it's in your bin folder. -a
specifies the path to art_illumina
, default assumes that it's in the user's bin folder. -fp
set to True means we run a 2D parameters experiments(coverage and editDist), which is presented in our paper. In the results folder, it will contain .csv and .png files representing different statistics.
- You will need to install
kallisto
and the link is provided here: https://pachterlab.github.io/kallisto/download. To run simulation using kallisto, run
run_kallisto_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-a ART] [-k KAL] [-fp FULLPAPER]
Instructions are similar to run_sim.py
, with-k
refers to the path to kallisto
and default assumes it is in user's bin folder.
- To run novel allele experiment, run
python run_loo_sim.py [-h] [-i NUMOFITER] [-f SIMULATIONRESULTFOLDER] [-c COVERAGE] [-b BOWTIE] [-s SAMTOOLS] [-a ART]
Instructions are the same as running run_sim.py
- In the
BorreliaPipeline
folder,strainAndFullPipeline_sim
contains the scripts and files to run the SDP simulation(with no reads involved) and full pipeline simulation for our method.strainEST_compare
contains the scripts to run the full pipeline simulation which compares with strainEST. The link to install strainEST is here: https://github.com/compmetagen/strainest. Details about running these simulations are written in separate README files instead in each of the folder.
A sample script to compute the MST is in the pipeline
folder and can be altered for your own analysis.