Syotti is a tool to find a small set of short DNA strings ("baits") to fully cover a set of target sequences. The method is described in our paper in Bioinformatics.
Strings A and B of equal length are d-Hamming neighbors if the Hamming distance of A to B or the reverse complement of B is at most d.
The problem is parameterized by two integers d and L. Suppose we have a set of reference sequences S_1, S_2, ... , S_n from the alphabet {A,C,G,T,N}. The special character N is defined to not match with itself. The goal is to find a small set of bait strings B_1, ... , B_m, all of equal length L, such that the baits together cover every position of the reference sequences. A bait is considered to cover all length-L substrings of the reference sequences that are d-hamming neighbors with the bait. Syotti produces an approximate solution for this problem.
Syotti solves the abstract string matching problem defined above. In particular, it does NOT optimize for evenness of coverage as long as the problem as defined above is solved. It also does not take into consideration practical issues such as off-target binding, bait melting temperature or bait-to-bait binding. Postprocessing the baits to suit the requirements of your particular sequencing project is advised.
The toolkit depends on the sdsl-lite library and zlib. Most likely zlib is already installed on your machine. To install sdsl-lite, first make sure you have the CMake build system installed, and then run the following commands:
git submodule init
git submodule update
cd sdsl-lite
sh install.sh
cd ..
After this, you may build the toolkit with:
make syotti
Currently, only FASTA input files are supported. To test the tool, we provide an example input file at testcases/coli3.fna. The example file contains three E. coli genomes. To compute a bait set for the genomes, run the following command:
./bin/syotti design --bait-len 120 --hamming-distance 40 -s testcases/coli3.fna -r -o baits.fna
This sets the bait length to 120 and tolerates 40 mismatches per match. The flag -r
randomizes the order in which the input sequences are processed, which often leads to better results. The baits are written to baits.fna.
Most of the tools here need an FM-index. The tools build it at the start, but they can also re-use a pre-computed FM-index. To build an FM-index, run for example the following:
./bin/syotti index -s testcases/coli3.fna -o coli3.fmi
The full set of input parameters is listed below
Build an FM index.
Usage:
syotti index [OPTION...]
-s, --sequences arg Path to the fasta file of the input sequences.
(default: "")
-o, --output arg Path to the output FM-index file (default: "")
-h, --help Print instructions.
For example, to compute a set of baits of length 120, allowing up to 40 mismatches, covering at least 98 percent of the file testcases/coli3.fna
(found in this repository), using the FM-index stored at coli3.fmi, use the following:
./bin/syotti design -L 120 -d 40 -c 0.98 -s testcases/coli3.fna -f coli3.fmi -o ./my_result_path
IMPORANT: This processes the sequences in the input file in order, and stops when 98% of the total sequence content in the file has been covered by the selected baits. The sequences at the very end of the file may thus end up with 0% coverage. If this is not acceptable, please run with -c 1.0
, or patch gaps afterwards with the fill-gaps subcommand.
The program writes three output files:
File name | Explanation |
---|---|
./my_result_path-baits.fna | The baits in fasta format. |
./my_result_path-cover-fractions.txt | A text file containing one line for each bait. The i-th line contains the fraction of characters covered by baits 0..i. |
./my_result_path-cover-marks.txt | A text file containing one line per input sequence. Line j contains an ascii string of zeroes and ones that has length equal to the length of the j-th input sequence. Character i of the j-th input sequence is covered by the bait set if the bit at position i of the j-th line is a one (indexing starts from zero). This is not an if-and-only-if condition. A position might be covered even if we have marked it with a zero if the FM-index seed-and-extend approximate search missed it. |
If the FM-index is not given, it is built at the start. The full set of input parameters is listed below.
Computes a greedy bait cover.
Usage:
syotti design
-L, --bait-len arg Length of the baits. (default: 120)
-d, --hamming-distance arg Number of allowed mismatches in the baits.
(default: 40)
-s, --sequences arg Path to a fasta file of the input sequences.
(default: "")
--fm-index-out arg The algorithm is based on FM-index, which we
build at the start. Building the index can take a
lot of time and memory. Use this option to save
the FM-index to disk so that you can later run
the algorithm with different parameters re-using
the same FM-index. (optional). (default: "")
-f, --fm-index arg Path to a previously saved FM-index on disk
(--fm-index-out). This option loads the FM index
from disk instead of building it again. (default:
"")
-o, --out arg Filename prefix for the output files. (default:
"")
-r, --randomize Randomize the processing order of the sequences in
the greedy algorithm.
-t, --n-threads arg Maximum number of parallel threads. The program
is not very well optimized for parallel
processing, so don't expect much of a speedup here.
(default: 1)
-c, --cutoff arg Stop the greedy algorithm after this fraction
of positions is covered. For example: 0.99.
(default: 1)
-g, --seed-len arg The length of the seeds in the FM-index
seed-and-extend approximate string search subroutine. A
lower value will find more matches, but will be
slower. (default: 20)
-h, --help Print instructions.
If the cutoff (--cutoff) of the greedy algorithm was set to below 1, there may be gaps in the cover. In this case, we can add in some more baits to ensure that the maximum gap length is at most a given value. For example, to continue from the example of the greedy algorithm above, you may run the following to ensure that the longest gap is at most 10 characters wide:
./bin/syotti fill-gaps -G 10 -d 40 -s testcases/coli3.fna -f coli3.fmi -b my_result_path-baits.fna -c my_result_path-cover-marks.txt -o my_extended_bait_set.fna
NOTE: the cover marks must correspond to the bait set provided. If have modified the bait set somehow, you need to re-compute the cover marks with the examine-command.
The full set of parameters is given below.
Fills gaps in a given bait cover.
Usage:
syotti fill-gaps [OPTION...]
-G, --max-gap arg Maximum allowable gap length. (default: 0)
-d, --hamming-distance arg Number of allowed mismatches in the baits.
(default: 40)
-s, --sequences arg Path to the fasta file of the input sequences.
(default: "")
-b, --baits arg Path to the fasta file of the baits. (default:
"")
-c, --cover-marks arg Path to the file of the cover marks created by
the greedy algorithm. (default: "")
-o, --out arg Output filename (fasta) (default: "")
--fm-index-out arg The algorithm is based on FM-index, which we
build at the start. Building the index can take a
lot of time and memory. Use this option to save
the FM-index to disk so that you can later run
the algorithm with different parameters re-using
the same FM-index. (optional). (default: "")
-f, --fm-index arg Path to a previously saved FM-index on disk
(--fm-index-out). This option loads the FM index
from disk instead of building it again. (default:
"")
-t, --n-threads arg Maximum number of parallel threads. The program
is not very well optimized for parallel
processing, so don't expect much of a speedup here.
(default: 1)
-g, --seed-len arg Seed and extend g-mer seed length (default: 20)
-v, --verbose Print debug output
-h, --help Print instructions.
If you want to reduce the number of baits, you may want to keep only some random subset of them. For example, to keep a subset of size 1000, run the following:
./bin/syotti subset -i input.fna -o output.fna -n 1000
The full set of input parameters is below:
Take a random subset of sequences in a fasta file.
Usage:
syotti subset [OPTION...]
-i, --input arg Input fasta file. (default: "")
-o, --output arg Output fasta file. (default: "")
-n arg Size of the subset. (default: -1)
-h, --help Print instructions.
To compute various statistics on any bait set baits.fna, run for example the following:
./bin/syotti examine -d 40 -s testcases/coli3.fna -f coli3.fmi -b baits.fna -o ./my_examine_output_path
This writes the following output files:
File name | Explanation |
---|---|
./my_examine_output_path-coverage.txt | A text file with a number of lines equal to the number of the reference sequences. Line i has a space-separated list of integers such that the j-th integer is the number of baits that cover position j in the i-th reference sequence |
./my_examine_output_path-cover-marks.txt | The same cover marks file as generated by the greedy algorithm. |
./my_examine_output_path-cover-fractions.txt | Same as the cover fractions file output by the greedy algorithm. |
./my_examine_output_path-gaps.txt | A text file consisting of one line of space-separated integers representing the lengths of the runs of zeroes in the cover marks file. If the cover has no gaps the line is empty. |
The full set of input parameters is listed below
Computes various statistics on a given bait set.
Usage:
syotti examine [OPTION...]
-d, --hamming-distance arg Number of allowed mismatches in the baits.
(default: 40)
-s, --sequences arg Path to the fasta file of the input sequences.
(default: "")
-b, --baits arg Path to the fasta file of the baits (default:
"")
-o, --out-prefix arg Filename prefix for the output files. (default:
"")
--fm-index-out arg The algorithm is based on FM-index, which we
build at the start. Building the index can take a
lot of time and memory. Use this option to save
the FM-index to disk so that you can later run
the algorithm with different parameters re-using
the same FM-index. (optional). (default: "")
-f, --fm-index arg Path to a previously saved FM-index on disk
(--fm-index-out). This option loads the FM index
from disk instead of building it again. (default:
"")
-t, --n-threads arg Maximum number of parallel threads. The program
is not very well optimized for parallel
processing, so don't expect much of a speedup here.
(default: 1)
-g, --seed-len arg The length of the seeds in the FM-index
seed-and-extend approximate string search subroutine. A
lower value will find more matches, but will be
slower. (default: 20)
-v, --verbose Print more debug output
-h, --help Print instructions.
cd googletest
mkdir build
cd build
cmake ..
make
cd ../..
make tests
The test executables will be written to ./bin