seq_simul is a DNA storage channel simulation tool.
Clone repo through ssh
git clone ssh://[email protected]/albert-no/seq_simul.git
Install editdistance
package
conda install -c conda-forge editdistance
Install seqanpy package for alignment
- download seqan 2.4.0 (Download_link)
- unzip the file and export the
include
folder path - download
swig
and export swig - install seqanpy using pip
export SEQAN_INCLUDE_PATH=/PATH/TO/SEQAN/LIBRARY/seqan-library-2.4.0/include
sudo apt install swig
export SWIG=swig
pip install seqanpy
seq_simul
is implemented using Python 3.7 and should be installed before running.
pip install -e .
We use the same oligo design proposed by Jeong et al (https://doi.org/10.1093/bioinformatics/btab246).
There are 18,000 oligos of length 152 in the dataset (stored in original_files directory), where all oligos have balanced GC contents (base G and C proportion is between 45 to 55 percent) and homopolymer run length is limited by 3. We obtain the corresponding FASTQ file from the single experiment, consisting of reads and quality scores. Due to GitHub repository data size limitations, upload them with partition compression and store them in the fastq_files directory. We use reverse reads to train the data.
Processing pipeline generates the data for training simulator.
The inputs of the processing pipeline are oligo text file (oligos.txt
)
which contains the oligo (encoded data)
and FASTQ file (.fastq
) which contains the read and quality-score.
The output of the processing pipeine is data
files.
The data
file stores the matching result.
Each matching result corresponds to 5 lines in data
file.
- read from fastq
- qscore (that corresponds to read)
- matched oligo
- edit-distance between the read and the oligo
- index of matched oligo
The processing pipeline does the following:
- Convert FASTQ file into reads file (extract reads and qscores from FASTQ file)
- Split reads file (split read files into subread files)
- Find edit distance and quality value between split files and oligo text file
- Save results in data file
- Separate data files into train-set and test-set under the defined proportion
- Separate data files in train-set and test-set through edit-distance (0 edit-distance & defined edit-distance range)
- Modify and save the data file which is saved in non-zero edit-distance folder into three errors data files (insertion, deletion, substitution data file)
To run the processing.py
, parameters in main
have to be modified appropriately.
path
is a path to FASTQ (.fastq) and oligo text (.txt) filefname
is a name of FASTQ and oligo text file. For example,fname=constraint
implies that FASTQ file name isconstraint.fastq
and oligo text file name isconstraint.txt
.edit_distance_limit
is a maximum allowable edit-distance between the read and the matched oligo.split_num
is a number of read which is saved in single reads file.
Then, run processing.py
at root to generate data files.
Parameters of get_train_test_set.py
need to be modified as personal use.
folder_path
is a path of matched dataset.division_ratio
is a division proportion of the train-set and test-set. (ifdivision_ratio=0.8
, it return 80% of train-set and 20% of test-set)
Run get_train_test_set.py
returns the proportionally divided train and test folder.
To customize parameters of trim_and_split.py
as follows.
It trims the length of the sequences and save separately within edit distance.
Also, split data into three errors(insertion, substitution, deletion).
Trimmed data are saved in the train, test folders, and error contained data are saved in insertion, deletion, substitution folders which are generated in the train and test folders.
Additionally, for the profile vector that is essential for simulation, proportions of the errors are saved as a log file named data_proportions.log
.
folder_path
is a path of matched data.min_edit_distance
&max_edit_distance
are the range of the edit distance to save separately from zero edit distance.limit_length
is the maximum length of the sequence to save.
Run trim_and_split.py
Basically, simulator is composed of three read generator(insertion, substitution, deletion), and qscore generator. GANs are trained with GRU generator and 1D-CNN discriminator, except substitution which is trained with Transformer generator. The training pipeline trains multiple models based on arguments.
Model parameters such as number of layer in the Generator (G_num_layer
) and the Discriminator (D_num_layer
),
and the size of hidden layer of the Generator (G_hidden_size
) and Discriminator (D_hidden_size
) can be tuned.
For transformer generator, number of head (G_num_head
) and positional drop out (pos_drop_prob
) also can be tuned.
Training parameters such as critic (G_critic
), lambda(lambda_gp
) for WGAN and learning rate for the Generator (G_lr
) and Discriminator (D_lr
),
decaying parameter for ADAM optimizer (b1
, b2
) can be tuned as well.
Detailed and defaulted arguments are at seq_simul/data/default_args.json
.
There are two types of parameters:
- fixed parameters that are fixed during the parameter sweep.
- iterating parameters that varies over training.
Fixed parameters are set via fixed_dict
in read_sweep.py
where keys are name of fixed parameters and values are corresponding fixed values.
Iterating parameters are set via iter_dict
where keys are iterating parameter and values are list of values
Then, read_sweep.py
runs wrapper
which is defined in seq_simul/train/wrapper.py
which trains the models for all possible combinations of iterating parameters.
Importatly, read_only: True
and qscore_only: False
should be stated in fixed_dict
.
After setting parameters, run read_sweep.py
with parameters to run the training pipeline.
python read_sweep.py
Qscore-Generator also has fixed_dict
for fixed parameters,
and iter_dict
for iterating parameters.
Set those dictionaries accordingly,
then run qscore_sweep.py
which runs wrapper
in seq/simul/train/wrapper.py
.
It trains the models for all possible combinations of iterating parameters.
Importatly, read_only: False
and qscore_only: True
should be stated in fixed_dict
.
After setting parameters, run qscore_sweep.py
with parameters to run the training pipeline.
python qscore_sweep.py
Simulator is composed of read-simulator and qscore-simulator.
Read-simulator also composed of insertion-generator, substitution-generator, and deletion-generator.
Based on profile-vector that saved in error_proportion.log
, the input oligo enters the multiple or single generator among three generators.
Qscore-simulator gets the result of read-simulator as input and generate quality score sequences.
- The input of the read-simulator is
.txt
oligo file, and outputs the.data
file. - The input of the qscore-simulator is
.data
file, and outputs the.fastq
file or.data
file.
Your output will be a .fastq
file when using qscore simulator with mode=qscore_fastq
.
The format from top to bottom:
- index
- read sequence (simulated read)
- +
- quality score (generated qscore)
Your output will be a .data
file when using qscore simualtor with mode=qscore_data
- index
- read sequence (simulated read)
- +
- quality score (generated qscore)
- oligo
Some parameters are modified before simulating sequences. Otherwise, model parameters are loaded from json files.
error_proportion_file
: path of the error proportion log filesimulation_fname
: path of input datasimulated_result_path
: name of file saved inresults/simulations/
ins/sub/del/qscore_simulation_folder
: path of folder of trained insertion, substitution, deletion and qscoreins/sub/del/qscore_simulation_fname
: file name of trained insertion, substitution, deletion and qscoreins/sub/del/qscore_epoch_list
: epochs for generating sequences
The above process is implemented in seq_simul/simulator/seq_simulator.py
, run simulator.py
to get the result.
python simulator.py
The result of simulation and the real expriment can be compared via statitstics tool.
There are 2 ways to analyze statistics.
- plot proportion of each index errors : all, insertion, substitution, deletion
- number of errors : all, insertion, substitution, deletion
- number of different base pair : insertion, substitution, deletion
- number of consecutive dashes : insertion, deletion
- plot consecutive dashes : insertion, deletion
- plot positional mean of quality scores.
- plot the distribution of error(insertion, substitution) occurred quality score.
mode
: select mode for analyzing (all|read|qscore)error_name
: error name to get statistics with 4-options(all|insertion|deletion|substitution).original_data_path
: folder path of real.data
fileoriginal_fname
: file name of real.data
generated_result_path
: folder path of generated filegenerated_fname
: file name of generated file (.data
or.fastq
)read_padded_length
: sequence length with pad to load reads.qscore_padded_length
: sequence length with pad to load quality scores.
python stats.py
returns the designated statistics and results are save at results/statistics/
python stat.py
Following direction, test the end-to-end example with mini dataset(/test_constraint
).
- Processing data
python processing.py --path=seq_simul/data/test_constraint/ --fname=test_constraint --split_num=25 --convert_fastq=True
- Get train and test dataset
python get_train_test_set.py --folder_path=seq_simul/data/test_constraint/test_constraint/data/ --division_ratio=0.8
- Get only error contained data and customized edit-distance data
python trim_and_split.py --folder_path=seq_simul/data/test_constraint/test_constraint/data --limit_length=145
- Train read-GAN (insertion|substitution|deletion)
Customize fixed and iterative dictionary in
read_sweep.py
insertiondata_path
:seq_simul/data/test_constraint/test_constraint/data/train/edit_1_5/insertion/
substitutiondata_path
:seq_simul/data/test_constraint/test_constraint/data/train/edit_1_5/substitution/
deletiondata_path
:seq_simul/data/test_constraint/test_constraint/data/train/edit_1_5/deletion/
python read_sweep.py
- Train qscore-GAN (errorness|error-free)
Customize fixed and iterative dictionary in
qscore_sweep.py
errorness quality scoredata_path
:seq_simul/data/test_constraint/test_constraint/data/train/edit_1_5/
error-free quality scoredata_path
:seq_simul/data/test_constraint/test_constraint/data/train/edit_0/
python qscore_sweep.py
- Simulate with trained read-generator and qscore-generator (need json and pth file) and designate multiple epochs for simulating
python simulator.py --mode=read
--error_proportion_file=seq_simul/data/test_constraint/test_constraint/error_proportion.log
--simulation_fname=seq_simul/data/test_constraint/test_constraint/data/test/edit_1_5/test_test_constraint_split_aa.data
--simulated_result_fname=read_simulated
--ins_simulation_folder=results/MMDDhhmm_sweep/trained_parameters
--sub_simulation_folder=results/MMDDhhmm_sweep/trained_parameters
--del_simulation_folder=results/MMDDhhmm_sweep/trained_parameters
--ins_simulation_fname=MMDDhhmm_{param}
--sub_simulation_fname=MMDDhhmm_{param}
--del_simulation_fname=MMDDhhmm_{param}
--ins_epoch_list n1 n2 n3 ..
--sub_epoch_list n1 n2 n3 ..
--del_epoch_list n1 n2 n3 ..
python simulator.py --mode=qscore_fastq
--simulation_fname=results/simulations/read_simulated.data
--simulated_result_fname=qscore_simulated
--qscore_simulation_folder=results/MMDDhhmm_sweep/trained_parameters
--qscore_simulation_fname=MMDDhhmm_{param}
--qscore_epoch_list n1 n2 n3 ...
- get statistics with simulated data file
python stats.py --mode=all (or read|qscore)
--error_name=all (or insertion|deletion|substitution)
--generated_result_path=results/
--generated_fname=simulated.data
User can simulate sequences using parameter of pre-trained generators.(at the folder named pre-trained_parameters
)
python simulator.py --mode=read
--ins_simulation_folder=pre-trained_parameters
--sub_simulation_folder=pre-trained_parameters
--del_simulation_folder=pre-trained_parameters
--ins_simulation_fname=insertion
--sub_simulation_fname=substitution
--del_simulation_fname=deletion
--ins_epoch_list 0 1 2 3 4 5 6 7 8 9
--sub_epoch_list 0 1 2 3 4 5 6 7 8 9
--del_epoch_list 0 1 2 3 4 5 6 7 8 9
python simulator.py --mode=qscore_fastq
--qscore_simulation_errorfree_folder=pre-trained_parameters
--qscore_simulation_errorfree_fname=errorfree
--qscore_errorfree_epoch_list 0
--qscore_simulation_errorness_folder=pre-trained_parameters
--qscore_simulation_errorness_fname=errorness
--qscore_errorness_epoch_list 0
--simulation_fname=results/simulations/simulated.data
To run the unittest, use the following command at root.
python -m unittest