DNA barcoding is a molecular phylogenetic method that uses a standard DNA sequence to identify species. By comparing sequences of specific regions to an existing reference database, samples can be identified with respect to species, genus, family or higher taxonomy rank.
Compared to morphological identification, DNA barcoding has these advantages:
- DNA sequences can offer many more characters for identification;
- requires small sample (mg level);
- samples can be of any form as long as they contain DNA; and
- can identify mixed samples.
However, it also has limitations:
- requires a high-quality reference database
- existing DNA barcodes show poor performance at the species level; and
- different DNA barcodes may conflict with each other in specific taxonomic groups.
To date, DNA barcoding has been used widely in:
- identifying species for research, food safety, customs inspections, criminal detection, forensic analyses, quality control of medicine, and so forth;
- identifying mixed samples (soil, water, air, intestinal contents, and so on) for research purposes, environmental surveys, medical analyses, and so forth;
- species classification, for determining relationships of species, delimiting cryptic species and validating morphological identification; and
- species descriptions can be supplied as supplementary information for specimen vouchers.
BarcodeFinder can discover novel DNA barcodes with universal primers automatically. It does three things:
-
Collects data
It can retrieve data automatically from the NCBI Genbank, using restrictions that the user provides, such as gene name, taxonomy, sequence name and organelle. Also, it can integrate sequences or alignments that users provide.
-
Pre-processes data
BarcodeFinder utilises annotation information in data to divide sequences into fragments (gene, spacer, miscellaneous features) because the data collected from Genbank may not be "uniform". For instance, it is possible to find a gene's upstream and downstream sequences in one record, but only the gene sequence in another record. The situation is worse for intergenic spacers due to various annotation styles which may cause trouble in the analysis to follow.
Given that one gene or spacer for each species may be sequenced several times, by default, BarcodeFinder removes redundant sequences, leaving only one record for each species. This behaviour can be changed as desired. Then, MAFFT is called for alignment. Each sequence's direction is adjusted, and all sequences are reordered.
-
Analyse
Firstly, BarcodeFinder evaluates the variance of each alignment by calculating Pi, the Shannon Index, the observed resolution, the tree resolution and the average terminal branch length. If the result is lower than the given threshold, i.e., it does not have sufficient resolution, then this alignment will be skipped.
Next, a sliding-window scan will be performed for those alignments that pass the test. The high-variance region (variance "hotspot") is picked, and its upstream/downstream regions are used to find primers.
The consensus sequences of those conserved regions for finding primers are generated, and with the help of Primer3, candidate primers are selected. After BLAST validation, suitable primers are combined to form several primer pairs. Given the limit of the PCR product's length, only pairs with desired length are left. Note that gaps are removed to calculated real length instead of the alignment length. The resolution of the sub-alignment is then recalculated to remove false-positive primer pairs.
Finally, primer pairs are reordered by score to make it easy for the user to find the "best" primer pairs.
BarcodeFinder could be used to:
- Collect data from Genbank. Full-support of Genbank's query syntax and optimization of download process make it easy for usage.
- Convert gb file to fasta. The software make good use of annotation in gb file to generate well-organized fasta files. Particularly, the extraction of complete taxonomy ranks can be extremely useful for phylogenetic researchers.
- Clean data. Various strategies are offered to remove redundant sequences. Several filters are also provided to pick out abnormal sequences.
- Evaluate sequence polymorphism. Supports kinds of methods to calculate variance of whole alignment and to mark high-variance region. Compatible with ambiguous base and gap. Utilizes phylogenetic method to provide robust result.
- Design universal primer. Abundant options, smart algorithm and strict validation result in reliable primers.
- Discover novel DNA barcode for specific taxa. Automatic and high-efficient process could significantly reduce researchers work to find new barcodes.
BarcodeFinder requires few computational resources. A normal PC/laptop is good enough. For a huge analysis that covers a large taxonomic group, a better computer may save time.
- Python3 (3.5 or above)
- BLAST+
- IQ-TREE
- MAFFT
- Biopython
- coloredlogs
- matplotlib
- numpy
- primer3-py
The data retrieval function requires an Internet connection. Please ensure a stable network and reasonable Internet traffic charge for downloading large-sized data sets.
We assume that users have already installed Python3 (3.5 or above). For Windows user, please use Python 3.6 or 3.7 if failed to install dependent packages.
Firstly, install BarcodeFinder.
The easiest way to do so is to use pip. Make sure pip is not out of date (18.0 or newer), then
# As administator
pip3 install BarcodeFinder
# Normal user
pip3 install BarcodeFinder --user
For some versions of Python (e.g. Python 3.7 or above), pip may ask the user to provide a compiler. If a user does not have a compiler (especially Windows users), we recommend this website. Users can download the compiled wheel file and use pip to install it:
# As administator
pip3 install wheel_file_name
# Normal user
pip3 install wheel_file_name --user
Secondly, users need to install the dependent software. BarcodeFinder has an assistant function to install dependent software automatically if it cannot find the software, i.e., users can skip this step if they wish. However, it is highly recommended that the official installation procedure be followed for ease of management and a clean working directory.
To avoid this, please try to use Python 3.6.
For Linux users with root privileges, just use the package manager:
# Ubuntu and Debian
sudo apt install mafft ncbi-blast+ iqtree
# Fedora (1)
sudo dnf install mafft ncbi-blast+ iqtree
# Fedora (2)
sudo yum install mafft ncbi-blast+ iqtree
# ArchLinux
sudo pacman -S mafft ncbi-blast+ iqtree
# FreeBSD
sudo pkg install mafft ncbi-blast+ iqtree
For MacOS users with root privileges, install brew if it has not been installed previously:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
If any errors occur, install Xcode from the App Store and retry.
Then:
brew install blast mafft brewsci/science/iqtree
If using Windows or lacking root privileges, users should follow these instructions:
-
BLAST+
-
MAFFT
-
Choose "All-in-one version", download and unzip. Then follow the steps in the BLAST+ installation manual to set the PATH.
-
Choose "Portable package", download and unzip. Then follow the instructions of BLAST+ to set the PATH for MAFFT.
-
Choose "All-in-one version", download and unzip. Then follow the steps in the BLAST+ installation manual to set the PATH.
-
-
IQ-TREE
-
Download the installer according to OS. Unzip and add the path of subfolder bin to PATH.
-
BarcodeFinder is a command line program. Once a user opens the command line (Windows) or terminal (Linux and MacOS), just type the command:
# Windows
python -m BarcodeFinder [input] -[options] -out [out_folder]
# Linux and MacOS
python3 -m BarcodeFinder [input] -[options] -out [out_folder]
- Download all rbcL sequences of plants(viridiplantae) and do pre-process.
# Windows
python -m BarcodeFinder -gene rbcL -taxon Viridiplantae -stop 1 -out rbcL_all_plant
# Linux and macOS
python3 -m BarcodeFinder -gene rbcL -taxon Viridiplantae -stop 1 -out rbcL_all_plant
- Download all ITS sequences of Rosa. Do pre-process and keep redundant sequences:
# Windows
python -m BarcodeFinder -query internal transcribed spacer -taxon Rosa -stop 1 -out Rosa_its -uniq no
# Linux and macOS
python3 -m BarcodeFinder -query internal transcribed spacer -taxon Rosa -stop 1 -out Rosa_its -uniq no
- Download all Poaceae chloroplast genome sequences in the RefSeq database, plus one's own data. Then do pre-process and evaluation of variance (skip primer design):
# Windows
python -m BarcodeFinder -og cp -refseq -taxon Poaceae -out Poaceae_cpg -fasta my_data.fasta -stop 2
# Linux and macOS
python3 -m BarcodeFinder -og cp -refseq -taxon Poaceae -out Poaceae_cpg -fasta my_data.fasta stop 2
- Download sequences of Zea mays, set length between 100 bp and 3000 bp, plus one's aligned data, and then run a full analysis:
# Windows
python -m BarcodeFinder -taxon "Zea mays" -min_len 100 -max_len 3000 -out Zea_mays -aln my_data.aln
# Linux and macOS
python3 -m BarcodeFinder -taxon "Zea mays" -min_len 100 -max_len 3000 -out Zea_mays -aln my_data.aln
- Download all Oryza chloroplast genomes (not only in RefSeq database), keep the longest sequence for each species and run a full analysis:
# Windows
python -m BarcodeFinder -taxon Oryza -og cp -min_len 50000 -max_len 500000 -uniq longest -out Oryza_cp
# Linux and macOS
python3 -m BarcodeFinder -taxon Oryza -og cp -min_len 50000 -max_len 500000 -uniq longest -out Oryza_cp
BarcodeFinder accepts:
- Genbank queries. Users can use "-query" or combine with other filters;
- unaligned fasta files. Each file is considered one locus when evaluating the variance;
- alignments (fasta format); and
- Genbank format files.
Note that ambiguous bases are allowed in sequence. If users want to use "*" or "?" to represent a series of files, they should make sure to use quotation marks. For example, "*.fasta" (include quotation marks) means all fasta files in the folder.
BarcodeFinder uses a uniform sequence ID for all fasta files that it generates.
SeqName|Kingdom|Phylum|Class|Order|Family|Genus|Species|Accession|SpecimenID|Type
# example
rbcL|Poales|Poaceae|Oryza|longistaminata|MF998442|TAN:GB60B-2014|
The order of the fields is fixed. The fields are separated by vertical bars ("|"). The space character (" ") was disallowed and was replaced by an underscore ("_"). Due to missing data, some fields may be empty.
-
SeqName
SeqName refers to the name of a sequence. Usually it is the gene name. For intergenic spacer, an underscore ("_") is used to connect two gene's names, e.g., "geneA_geneB".
If a valid sequence name cannot be found in the annotations of the Genbank file, BarcodeFinder will use "Unknown" instead.
For chloroplast genes, if "-rename" option is set, the program will try to use regular expressions to fix potential errors in gene names.
-
Kingdom
The kingdom (Fungi, Viridiplantae, Metazoa) of a species. For convenience, a superkingdom (Bacteria, Archaea, Eukaryota, Viruses, Viroids) may be used if the kingdom information for a sequence is missing.
-
Phylum
The phylum of the species.
-
Class
The class of the species.
Because some species' classes are left emtpy (for instance, basal angiosperm) in the cases of plants, BarcodeFinder will guess the class of the species.
Given the taxonomy information in Genbank file:
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; basal Magnoliophyta; Amborellales; Amborellaceae; Amborella.
BarcodeFinder will use "basal Magnoliophyta" as the class because this expression is located before the order ("Amborellales").
-
Order
The order name of the species.
-
Family
The family name of the species.
-
Genus
The genus name of the species, i.e., the first part of the scientific name.
-
Species
The specific epithet of the species, i.e., the second part of the scientific name of the species. It may contains the subspecies' name.
-
Accession
The Genbank Accession number for the sequence. It does not contain the record's version.
-
SpecimenID
The ID of the specimen of the sequence. Usually this value is empty.
-
Type
The type of the sequence. Could be "gene", "spacer", "intron" or else.
All results will be put in the output folder. If the user does not set the output path via "-out", BarcodeFinder will create a folder labelled "Result".
-
a.gb
The raw Genbank file. The a comes from the query's keyword.
-
a.plus
The raw Genbank file plus extended annotations for spacers and introns.
-
a.fasta
The converted fasta file of the ".gb" file.
-
b.primer.csv
The list of primer pairs in CSV (comma-separated values text) format. The b is the name of the locus/fragment (usually a gene or spacer).
Its title:
Locus,Score,Samples,AvgProductLength,StdEV,MinProductLength,MaxProductLength,Coverage,Resolution,TreeValue,AvgTerminalBranchLen,Entropy,LeftSeq,LeftTm,LeftAvgBitscore,LeftAvgMismatch,RightSeq,RightTm,RightAvgBitscore,RightAvgMismatch,DeltaTm,AlnStart,AlnEnd,AvgSeqStart,AvgSeqEnd
-
Locus
The name of the locus/fragment.
-
Score
The score of this pair of primers. Usually the higher, the better.
-
Samples
The number of sequences which were used to find this pair of primers.
-
AvgProductLength
The average length of the DNA fragment amplified by this pair of primers.
-
StdEV
The standard deviation of the AvgProductLength. A higher number means the primer may amplify different lengths of DNA fragments.
-
MinProductLength
The minimum length of an amplified fragment.
-
MaxProductLength
The maximum length of an amplified fragment. Note that all of these fields are calculated using given sequences.
-
Coverage
The coverage of this pair of primers over the sequences it used. Calculated with the BLAST result. High coverage means that the pair is much more "universal".
-
Resolution
The observed resolution of the sub-alignment sliced by the primer pair, which is equal to the number of unique sequences divided by the number of total sequences. The value is between 0 and 1.
-
TreeValue
The tree resolution of the sub-alignment, which is equal to the number of internal nodes on a phylogenetic tree (constructed from the alignment) divided by number of terminal nodes. The value is between 0 and 1.
-
AvgTerminalBranchLen
The average of the terminal branch's length.
-
Entropy
The Shannon equitability index of the sub-alignment. The value is between 0 and 1.
-
LeftSeq
Sequence of the forward primer. The direction is 5' to 3'.
-
LeftTm
The melting temperature of the forward primer. The unit is degress Celsius (°C).
-
LeftAvgBitscore
The average raw bitscore of the forward primer, which is calculated by BLAST.
-
LeftAvgMismatch
The average number of mismatched bases of the forward primer, as counted by BLAST.
-
RightSeq
Sequence of reverse primer. The direction is 5' to 3'.
-
RightTm
The melting temperature of the reverse primer. The unit is degrees Celsius (°C).
-
RightAvgBitscore
The average raw bitscore of the reverse primer, which is calculated by BLAST.
-
RightAvgMismatch
The average number of mismatched bases of the reverse primer, as counted by BLAST.
-
DeltaTm
The difference in the melting temperatures of the forward and reverse primers. A pair of primers with a high DeltaTm may result in failure during the PCR experiment.
-
AlnStart
The location of the beginning of the forward primer (5', leftmost of primer pairs) in the entire alignment.
-
AlnEnd
The location of the end of the reverse primer (5', rightmost of primer pairs) in the entire alignment.
-
AvgSeqStart
The average beginning of the forward primer in the original sequences. ONLY USED FOR DEBUG.
-
AvgSeqEnd
The average end of the forward primer in the original sequences. ONLY USED FOR DEBUG.
The primer pairs are sorted by Score. Since the score may not fully satisfy the user's specific considerations, it is suggested that primer pairs be chosen manually if the first primer pair fails during the PCR experiment.
-
-
b.primer.fastq
The fastq format file of a primer's sequence. It contains two sequences, and the direction is 5' to 3'. The first is the forward primer, and the second is the reverse primer. The quality of each base is equal to its proportion of the column in the alignment. Note that the sequence may contains amibiguous bases if it was not disabled.
-
b.pdf
The PDF format of the figure containing the sliding-window scan result of the alignment.
-
b.png
The PNG format of the figure containing the sliding-window scan result of the alignment.
-
b.variance.tsv
The CSV format of the sliding-window scan result. "Index" means the location of the base in the alignment. Note that the value DOES NOT means the variance of the base column; instead it refers to the variance of the fragment started from this column.
-
Log.txt
The log file. Contains all the information printed on the screen.
-
Options.json
The JSON file stores all options that the user inputs.
-
Loci.csv
The summary of all loci/fragments, which only contains the variance information for each fragment. The only new field, GapRatio, means the ratio of the gap ("-") in the alignment. A higher value means that the sequences may have too many insertions/deletions or the alignment is not reliable.
-
by_name
The folder contains "undivided" sequences and intermediate results. Actually they are "roughly divided" sequences. The original Genbank file is firstly divided into different fasta files if the Genbank record contains different contents. Usually, one Genbank record contains serveral annotated regions (multiple genes, for example). If two records contains the same series of annotations (same order), they are put into same fasta file. Each file contains the intact sequence form the related Genbank record.
-
by_gene
The folder contains divided sequences and intermediate results. After the divided step occurred in by_name, BarcodeFinder then divides each cluster of Genbank records into several fasta files so that each file contains only one region (one locus, one gene, one spacer or one misc_feature) of the annotation.
For instance, a record in a "rbcL.gb" file may also contains atpB gene's sequences. The "rbcL.fasta" file does not contain any upstream/downstream sequences (except for ".expand" files) and "atpB_rbcL.fasta" does not have even one base of the atpB or rbcL gene, just the spacer (assuming that the annotation is precise).
User can skip this dividing step by setting "-no_divide" to use the whole sequence for analysis. Note that doing so DOES NOT skip the first dividing step.
These two folders can usually be ignored. However, a user may utilise one of these intermediate results (especially for those who only use BarcodeFinder to collect data from Genbank):
-
b.fasta The raw fasta file converted directly from the Genbank file containing only sequences of one locus/fragment.
-
b.expand
To design primers, BarcodeFinder extend a sequence to its upstream/downstream. Users can use "-expand 0" to skip the expansion. The next step generates files that all have ".expand" in their filenames.
-
b.uniq
Non-redundant sequences.
-
b.uniq.aln
The alignment of the fasta file.
-
b.uniq.candidate.fasta
The candidate primers. This file may contains thousands of records. We do not recommend paying attention to it.
-
b.uniq.candidate.fastq
Again, the candidate primers. This time, the file has the quality information that equals base's proportion in the column of the alignment.
-
b.uniq.consensus.fastq
The fastq format of the consensus sequence of the alignment. Note that it contains alignment gap ("-"). Although this may be the most useful file in the folder, it is NOT RECOMMENDED that it be used directly because consensus-genrating algorithm are optimised for primer design. Hence, the consensus sequence may be different from the "real" consensus.
-
-
-h
Prints help messages for the program. It is highly recommended to use this option to see the list of options and their default values.
-
-aln filename
Alignment files that the user provides. The filename can consist of one file or a series of files. One can use "?" and "*" to represent one or any characters. Be sure to use quotation marks. For example, "a*.alignment" means any file starting with the letter "a" and ending with ".alignment".
It only supports the fasta format. Ambiguous bases and gaps ("-") are supported.
-
-fasta filename
User-provided unaligned fasta files. Also supports "*" and "?". If the user wants to use "-uniq" function, the sequences should be renamed. See the format for the sequence ID above.
-
-gb filename
User-provided Genbank file or files.
-
-stop value
To stop the running BarcodeFinder at a specific step. BarcodeFinder provides an all-in-one solution to find novel DNA barcodes. However, some users may only want to use one module. The value could be
- 1 Only collect data and do pre-processing (download, divide, remove redundant, rename, expand); or
- 2 Do step 1, and then analyse the variance. Do not design primers.
-
-out value
The output folder's name. All results will be put into the output folder. If the user does not set an output path via "-out", BarcodeFinder will create a folder named "Result".
BarcodeFinder does not overwrite the existing folder with the same name.
It is HIGHLY RECOMMENDED to use only letters, numbers and underscores ("_") in the folder name to avoid mysterious errors caused by other Unicode characters.
-
-allow_mosaic_spacer
If one gene is nested with another gene, normally they do not have spacers.
However, some users want the fragments between two gene's beginnings and ends. This option is for this specific purpose. For normal usage, do not recommend.
-
-allow_repeat
If genes repeated in downstream, this option will allow the repeat region to be extracted, otherwise any repeated region will be omitted.
The default value is False.
-
-allow_invert_repeat
If two genes invert-repeated in downstream, this option will allow the spacer of them to be extracted, otherwise the spacer will be omitted.
For instance, geneA-geneB located in one invert-repeat region (IR) of chloroplast genome. In another IR region, there are geneB-geneA. This option will extract sequences of two different direction as two unique spacers.
The default value is False.
-
-email address
BarcodeFinder uses Biopython to handle the communication between the user and the NCBI Genbank database. The database requires that the to provide an email address in case of abnormal situations that require NCBI to contact the user. The default address was designed to be empty.
However, for the convenience of the user, BarcodeFinder will use "[email protected]" if the user does not provide an email address.
-
-exclude option
Use this option to use negative option. For instance, "-exclude Zea [organism]" (do not include quotation marks) will add " NOT (Zea[organism])" to the query.
This option can be useful for exclude specific taxon.
-taxon Zea -exclude "Zea mays"[organism]
This will query all records in genus Zea while records of Zea mays will be exclude.
For much more complex exclude options, please consider to use "Advance search" in Genbank website.
-
-gene name
The gene's name which the user wants to query in Genbank. If the user wants to use logical expressions like "OR", "AND", "NOT", s/he should use "-query" instead. If there is space in the gene's name, make sure to use quotation marks.
Note that "ITS" is not a gene name--it is "internal transcribed spacer".
Sometimes "-gene" options may bring in unwanted sequences. For example, if a user queries "rbcL[gene]" in Genbank, spacers containing rbcL or rbcL's upstream/downstream gene may be found, such as "atpB_rbcL spacer" or atpB.
-
-group value
To restrict a group of species to their superkingdom or kingdom, the value can be
- animals
- plants
- fungi
- protists
- bacteria
- archaea
- viruses
It is reported that the "group" filter may return abnormal records, for instance, return plants' records when the group is "animal" and the "organelle" is "chloroplast". Furthermore, it may match a great number of records in Genbank. Hence, we strongly recommend using "-taxon" instead.
The default value is empty.
-
-min_len value
The minimum length of the records downloaded from Genbank. The default value is 100 (bp). The number must be an integer.
-
-max_len value
The maximum length of the records downloaded from Genbank. The default value is 10000 (bp). The number must be an integer.
-
-molecular type
The molecular type, which could be DNA or RNA. The default type is empty.
-
-og type (or -organelle type)
Adds "organelle[filter]" to a query to limit results to a given organelle type only.
The type could be
- mitochondrion, or mt
- plastid, or pl
- chloroplast, or cp
Usually, users only want organelle genomes instead of fragments. One solution for leaving out fragments is to set "-min_len" and "-max_len" to use a length filter to obtain genomes. Another simple solution is to use RefSeq only by adding the "-refseq" option.
For instance,
# all chloroplast sequences of Poaceae (not only in RefSeq) -taxon Poaceae -og chloroplast -min_len 50000 -max_len 300000 # all chloroplast sequences of Poaceae (only in RefSeq) -taxon Poaceae -og chloroplast -refseq
Make sure not to make any typo (e.g., chlorplast, or mitochondria).
-
-query string
The query string provided by the user. It behaves in the same manner as the query the user typed into the Search Box in NCBI Genbank's webpage.
Make sure to follow NCBI's grammar for queries. It can contain several words. Remember to add quotation marks if an item contains more than one words, for instance, *"Homo sapiens"[organism].
Do not add quotation marks at the beginning and end of the query string. For instance, *"cbs[gene] AND "Homo sapiens"[organism]" may return empty results.
-
-refseq
Ask BarcodeFinder to only query sequences in the RefSeq database. RefSeq is considered to be of higher quality than the other sequences in Genbank.
If the user set this option, "-min_seq" and "-max_seq" will be removed to set no limit on the sequence length. If the user really wants to set a length limit when using "-refseq", s/he can put this filter into the query string:
# Get all Poaceae sequence in RefSeq, sequences should be 1000 to 10000 bp -query refseq[filter] -taxon Poaceae -min_len 1000 -max_len 10000
Usually, this option will be combined with "-og" to obtain organelle genomes.
Note that this option is of Boolean type. It IS NOT followed with a value.
-
-seq_n value
Download part of records. The value should be integer.
The defaule value is None, i.e., download all records.
-
-taxon taxonomy
The taxonomy name. It could be any taxonomic rank from kingdom (same as "-group") to species, as long as the user inputs correct name (the scientific name of species or taxonomic group in latin, NOT ENGLISH). It will restrict the query to the targeted taxonomy unit. Make sure to use quotation marks if taxonomy has more than one word.
-
-expand value
The expand length for going upstream/downstream. If set, BarcodeFinder will expand the sequence to its upstream/downstream after the dividing step to find primer candidates. Set the number to 0 to skip.
The default value is 0 if users set "-stop" to 1 or 2, i.e., users do not want to run the primer-design process.
If users run the whole process but forget to set "-expand", BarcodeFinder will automatically set "-expand" to 200. However, users can force the program to not to expand the sequence by setting it to 0.
-
-max_name_len value
The maximum length of a feature name. Some annotation's feature name in Genbank file is too long, and usually, they are not the target sequence the user wants. By setting this option, BarcodeFinder will truncate the annotation's feature name if it is too long. By default, the value is 50.
-
-max_seq_len value
The maximum length of a sequence for one annotation. Some annotations' sequences are too long (for instance, one gene has two exons, and its intron is longer than 10 Kb). This option will skip those long sequences. By default, the value is 20000 (bp).
Note that this option is different with "-max_len". This option limits the length of one annotation's sequence. The "-max_len" limits the whole sequence's length of one Genbank record.
For an organelle genome's analysis, if the user sets the "-no_divide" option, this option will be ignored.
-
-no_divide
If set, it will analyse the whole sequence instead of the divided fragments. By default, BarcodeFinder divides one Genbank record into several fragments according to its annotation.
Note that this option is of Boolean type. It IS NOT followed with a value.
-
-rename
If set, the program will try to rename genes. For instance, "rbcl" will be renamed to "rbcL", and "tRNA UAC" will be renamed to "trnVuac", which consists of "trn", the amino acid's letter and transcribed codon. This may be helpful if the annotation has nonstandard uppercase/lowercase or naming format so it can merge the same sequences to one file for the same locus having variant names.
If using Windows, consider using this option to avoid contradictory filenames.
It is also of Boolean type. The default is not to rename.
-
-uniq method
The method used to remove redundant sequences. BarcodeFinder will remove redundant sequences to ensure only one sequence per species by default. A user can change its behaviour by setting different methods.
-
longest
Keep the longest sequence for one species. The program will compare the sequence's length from the same species' same locus.
-
random
The program will randomly pick one sequence for one species' one locus if there is more than one sequence.
-
first
According to the records' order in the original Genbank file, only the first sequence of the same species' same locus will be kept. Others will be ignored directly. This is the default option due to performance considerations.
-
no
Skip this step. All sequences will be kept.
-
-
-fast
If set, BarcodeFinder will skip the calculations for the "tree resolution" and "average terminal branch length" to reduce running time.
Although the program has been optimized greatly, the phylogenetic tree's inferences can be time consuming if there are too many species (for instance, 10,000). If the user wants to analyse organelle genomes and the species are too numerous, the user can set this option to reduce time. The "tree resolution" and "average terminal branch length" will both become 0 in the resultsu file.
-
-step value
The step length for the sliding-window scan. The default value is 50. If the input dataset is too large, an extreamely small value (such as 1 or 2) may require too much time, especially when the "-fast" option is not used.
-
-a value
The maximum number of ambiguous bases allowed in one primer. The default value is 4.
-
-c value
The minimum coverage of the base and primer. The default value is 0.6 (60%). It is used to remove primer candidates if its coverage among all sequences is smaller than the threshold. The coverage of primers is calculated by BLAST.
Also, it is used to generate a consensus sequence. For one column, if the proportion of one type of base (A, T, C, G) is smaller than the threshold, the program will try to use an ambiguous base that represents two type of bases, and then three, then four ("N").
-
-m value
The maximum number of mismatched bases in a primer. This options is used to remove primer candidates if the BLAST results show that there is too much mismatch. The default value is 4.
-
-pmin value
The minimum length of the primer length. The default value is 18.
-
-pmax value
The maximum length of the primer length. The default value is 24.
-
-r value
The minimum observed resolution of the fragments or primer pairs. The default value is 0.5. It is used to skip conserved fragments (alignment or sub-alignment defined by a pair of primers).
BarcodeFinder uses the observed resolution instead of others for several reasons:
-
speed
The calculation of the observed resolution is very fast.
-
accuracy
Due to the existence of possible alignment errors, the observed resolution may be higher than the resolutions obtained via other evaluation methods. Hence, it is used as a lower bound. That is to say, the program considers that a fragment with a low observed resolution may not have a satisfactory tree resolution either.
By setting it to 0, BarcodeFinder can skip this filtration step. Meanwhile, the running time may be extremely long.
-
-
-t value
Only keeps value pairs of primers for each highly variant region. The default value is 1, i.e., only keep the best primer pair. To choose the best pairs of primers, the Score each pair received is used. To keep more pairs, set "-t" to more than 1.
-
-tmin value
The minimum product length (include primer). The default value is 300 (bp). Note this limits the PCR product's length instead of the sub-alignment's length.
-
-tmax value
The maximum product length (include primer). The default value is 500 (bp). Note that it limits the length of the PCR product given by the primer pair instead of the alignment.
The "-tmin" and "-tmax" are used to screen primer candidates. It uses BLAST results to set the location of primers on each template sequence and calculates the average lengths of the products. Because of the variance of species, the same locus may have differenct lengths in different species, plus with the stretching of the alignment that gaps were added during the aligning, please consider adding some margins for these two options.
For instance, if a user wants the amplified length to be smaller than 800 and greater than 500, s/he could consider setting "-tmin" to 550 and "-tmax" to 750.
For a taxon that is not very large and includes few fragments, BarcodeFinder can finish the task in minutes. For a large taxon (such as the Asteraceae family or the whole class of the Poales) and multiple fragments (such as the chloroplast genomes), the time to complete may be one hour or more on a PC or laptop.
BarcodeFinder requires less memory (usually less than 0.5 GB, although, for a large taxon BLAST may require more) and few CPUs (one core is enough). It can run very well on a normal PC. Multiple CPU cores may be helpful for the alignment and tree construction steps.
For Windows users, MAFFT may be very slow due to anti-virus software. Please consider following this instruction to install Ubuntu on Windows to obtain better results.