Home

Documentation v. 20170920

University of Vigo, Spain, http://darwin.uvigo.es

1. About the reference selector

This has been developed for simulations of targeted-sequencing experiments under a known species/gene tree distribution. The program extracts the reference sequences that would have been used as target in the probe design.

2. Assumptions

We are working under a SimPhy - NGSphy simulation pipeline scenario. Meaning, it follows hierarchical SimPhy's folder structure and sequence labeling.

3. Input

- [SimPhy](https://github.com/adamallo/simphy) folder path
- prefix of the existing [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files
- prefix for the output files
- method indicating how to obtain the reference sequences
- (optional) file with the description of the sequences that will be used as reference
- (optional) length of the N sequence that will be used to separate the sequences when concatenated
-

4. Output

The output will be a directory of FASTA files
There should be as many FASTA files as replicates have been generated for the current SimPhy project
Each file will contain all the selected loci, either concatenated or as a multiple alignment file

5. Installation

# 1. Clone repository
git clone https://github.com/merlyescalona/refselector.git
# 2. Move to folder
cd refselector
# 3. Install
python setup.py install --user

6. Usage

The SimPhy/NGSphy reference selector does not have a Graphical User Interface (GUI) and works on the Linux/Mac command line in a non-interactive fashion.

usage: refselector  -p <path>   -ip <input_prefix>
                    -op <output_prefix>     -o <output_path>
                    -m <method_code>    [ -n <N_seq_size> ]
                    [ -sdf <sequence_descriptions_file_path> ]
                    [-l <log_level>] [-v] [-h]

6.1. Required parameters

-s <path>,--simphy-path <path>:
- description: Path of the SimPhy folder.
- type: string (path)
-ip <input_prefix>,--input-prefix <input_prefix>:
- description: Prefix of the FASTA filenames.
- type: string
-p <ploidy>,--ploidy <ploidy>: - -
- description: ploidy of the dataset.
- type: number (integer)
- values: [1,2] (default: 1)
-op <output_prefix>,--outuput-prefix <output_prefix>: - -
- description: Prefix for the output filename.
- type: string
-o <output_path>,--output <output_path>:
- description: Path where output will be written.
- type: string (path)
-m <method_code>,--method <method_code>:
- description: Specified method to obtain the reference loci used for the design of probes.
- type: number (int) in the closed interval [0,4].
- values:
  - [0] Considers the outgroup sequence as the reference loci (default).
  - [1] Extracts a specific sequence per locus. Needs parameter -sdf/--seq-desc-file
  - [2] Selects a random sequence from any of the the ingroups.
  - [3] Selects randomly a specie and generates a consensus sequence of the sequences belonging to that species.
  - [4] Generates a consensus sequences from all the sequences involved

NOTE: The higher the method number, the longer it will take to generate the reference loci.

6.2. Optional parameters

-n <N_seq_size>, --nsize <N_seq_size> :
- description: Number of N's that will be introduced to separate the reference sequences selected. If the parameter is not set, the output file per replicate will be a multiple alignment sequence file otherwise, the output will be a single sequence file per replicate consisting of a concatenation of the reference sequences selected separated with as many N's as set for this parameter.
- type: number (int) where x >= 0.
-sdf <sequence_descriptions_file_path>, --seq-desc-file <sequence_descriptions_file_path>
- description: when method = 4 has been selected, it is required to identify which sequences will be selected per locus per replicate into a tab-separated file.
- type: string (path)
- format:

replicate_ID    locus_ID    sequence_description_locus

Example:

1    1   1_0_0 # Replicate 1, locus 1, sequence 1_0_0
1    2   2_0_0 # Replicate 1, locus 2, sequence 2_0_0
2    1   1_0_1 # Replicate 2, locus 1, sequence 1_0_1
2    2   1_0_3 # Replicate 2, locus 2, sequence 1_0_3

-l <log_level>, --log <log_level>
- description: Specified level of log that will be shown through the standard output. Entire log will be stored in a separate file when level==DEBUG.
- type: enumerate
- values:
  - DEBUG: shows very detailed information of the program's process.
  - INFO (default): shows only information about the state of the program.
  - WARNING: shows only system warnings.
  - ERROR: shows only execution errors.

6.3. Information parameters

-v, --version: Show program's version number and exit.
-h, --help: Show help message and exit.

Manual

Provide feedback

Saved searches

Use saved searches to filter your results more quickly