Skip to content
Merly Escalona edited this page Oct 17, 2017 · 7 revisions

Documentation v. 20170920

© 2017 Merly Escalona ([email protected])

University of Vigo, Spain, http://darwin.uvigo.es

1. About the reference selector

This has been developed for simulations of targeted-sequencing experiments under a known species/gene tree distribution. The program extracts the reference sequences that would have been used as target in the probe design.

2. Assumptions

  • We are working under a SimPhy - NGSphy simulation pipeline scenario. Meaning, it follows hierarchical SimPhy's folder structure and sequence labeling.

3. Input

- [SimPhy](https://github.com/adamallo/simphy) folder path
- prefix of the existing [FASTA](https://en.wikipedia.org/wiki/FASTA_format) files
- prefix for the output files
- method indicating how to obtain the reference sequences
- (optional) file with the description of the sequences that will be used as reference
- (optional) length of the N sequence that will be used to separate the sequences when concatenated
-

4. Output

  • The output will be a directory of FASTA files
  • There should be as many FASTA files as replicates have been generated for the current SimPhy project
  • Each file will contain all the selected loci, either concatenated or as a multiple alignment file

5. Installation

# 1. Clone repository
git clone https://github.com/merlyescalona/refselector.git
# 2. Move to folder
cd refselector
# 3. Install
python setup.py install --user

6. Usage

The SimPhy/NGSphy reference selector does not have a Graphical User Interface (GUI) and works on the Linux/Mac command line in a non-interactive fashion.

usage: refselector  -p <path>   -ip <input_prefix>
                    -op <output_prefix>     -o <output_path>
                    -m <method_code>    [ -n <N_seq_size> ]
                    [ -sdf <sequence_descriptions_file_path> ]
                    [-l <log_level>] [-v] [-h]

6.1. Required parameters

  • -s <path>,--simphy-path <path>:
    • description: Path of the SimPhy folder.
    • type: string (path)
  • -ip <input_prefix>,--input-prefix <input_prefix>:
    • description: Prefix of the FASTA filenames.
    • type: string
  • -p <ploidy>,--ploidy <ploidy>: - -
    • description: ploidy of the dataset.
    • type: number (integer)
    • values: [1,2] (default: 1)
  • -op <output_prefix>,--outuput-prefix <output_prefix>: - -
    • description: Prefix for the output filename.
    • type: string
  • -o <output_path>,--output <output_path>:
    • description: Path where output will be written.
    • type: string (path)
  • -m <method_code>,--method <method_code>:
    • description: Specified method to obtain the reference loci used for the design of probes.
    • type: number (int) in the closed interval [0,4].
    • values:
      • [0] Considers the outgroup sequence as the reference loci (default).
      • [1] Extracts a specific sequence per locus. Needs parameter -sdf/--seq-desc-file
      • [2] Selects a random sequence from any of the the ingroups.
      • [3] Selects randomly a specie and generates a consensus sequence of the sequences belonging to that species.
      • [4] Generates a consensus sequences from all the sequences involved

NOTE: The higher the method number, the longer it will take to generate the reference loci.

6.2. Optional parameters

  • -n <N_seq_size>, --nsize <N_seq_size> :
    • description: Number of N's that will be introduced to separate the reference sequences selected. If the parameter is not set, the output file per replicate will be a multiple alignment sequence file otherwise, the output will be a single sequence file per replicate consisting of a concatenation of the reference sequences selected separated with as many N's as set for this parameter.
    • type: number (int) where x >= 0.
  • -sdf <sequence_descriptions_file_path>, --seq-desc-file <sequence_descriptions_file_path>
    • description: when method = 4 has been selected, it is required to identify which sequences will be selected per locus per replicate into a tab-separated file.
    • type: string (path)
    • format:
replicate_ID    locus_ID    sequence_description_locus
  • Example:
1    1   1_0_0 # Replicate 1, locus 1, sequence 1_0_0
1    2   2_0_0 # Replicate 1, locus 2, sequence 2_0_0
2    1   1_0_1 # Replicate 2, locus 1, sequence 1_0_1
2    2   1_0_3 # Replicate 2, locus 2, sequence 1_0_3
  • -l <log_level>, --log <log_level>
    • description: Specified level of log that will be shown through the standard output. Entire log will be stored in a separate file when level==DEBUG.
    • type: enumerate
    • values:
      • DEBUG: shows very detailed information of the program's process.
      • INFO (default): shows only information about the state of the program.
      • WARNING: shows only system warnings.
      • ERROR: shows only execution errors.

6.3. Information parameters

  • -v, --version: Show program's version number and exit.
  • -h, --help: Show help message and exit.