Skip to content

Preparing reference data sets

Will Haese-Hill edited this page Mar 28, 2024 · 8 revisions

Preparing reference data for Companion

This document will briefly outline the steps necessary to set up a reference database for the Companion annotation pipeline.

Why do I need a reference database?

Companion uses pre-compiled information about the reference organisms to drive certain reference-based steps, for example:

  • contig-chromosome alignment to order and orientate contigs to form pseudochromosomes,
  • protein-DNA alignment for improving ab initio gene finding accuracy and pseudogene annotation,
  • transfer of highly conserved gene models from a single reference organism,
  • transfer of functional annotation from orthologous features in a single reference organism to the newly annotated genome, and
  • quick identification of core genes, missing core genes and singletons w.r.t. a set of reference organisms.

This means that for each reference organism, the pipeline requires gene structures, genome sequence, protein sequences for each coding gene, and an annotation of which sequences in the reference are chromosomes (as they could be intermingled with unassembled contigs in the input sequence file).

Compiling a reference database here means collecting all reference information needed, such as sequence, annotation (structural and functional), trained gene models for AUGUSTUS, as well as orthologous groups across the reference species to compare to members in the new genome. The next sections will explain what data is needed.

Required input files

For each reference organism, the input data are needed:

  • a descriptive name of the organism
  • a short abbreviation for the organism
  • the genome sequence in a single file
  • the annotated protein sequence in a single file
  • a structural gene annotation in GFF3 format (see below for details)
  • functional GO annotation in GAF 1.0 format, on the gene level
  • a pattern matching chromosome headers, describing how to extract chromosome numbers from them
  • AUGUSTUS models, trained on reference genes

Expected sequence format

Sequences must be in FASTA format, with one-word headers without special characters such as ?<";|\{}[]+=- etc. The width of the individual lines does not matter (i.e. whole sequences on one line are fine, as are sequences separated over many lines).

Expected GFF3 structure

The features in the input GFF3 file should have the following minimal type structure (e.g. should be connected like this in via Parent attributes):

gene
└───mRNA
    └───CDS

pseudogene
└─── pseudogenic_transcript
    └───pseudogenic_exon

polypeptide (derives_from mRNA)

gene
└───rRNA

gene
└───tRNA

gene
└───ncRNA

The file must be sorted and strictly valid GFF3! Make sure your file is validated, or when in doubt clean it up with GenomeTools:

gt gff3 -sort -retainids -tidy input.gff3 > input.clean.gff3

Then, continue working with the cleaned file input.clean.gff3. The command above will also provide error messages if the file is broken in such a way that it cannot be automatically fixed.

Please note that the polypeptide features should all have a 'product' attribute of the following form (example): product=term%3Dser%2Fthr protein phosphatase family protein, i.e. the product description for that polypeptide, prefixed by 'term=' and GFF3-escaped.

The exports from GeneDB (ftp://ftp.sanger.ac.uk/pub/project/pathogens/gff3) all follow this format so they should be usable as reference input out of the box. For flat file tables from EuPathDB ('Gene Information Tables', e.g. from http://amoebadb.org/common/downloads/Current_Release/EhistolyticaHM1IMSS/txt), we also provide a Python library and script to convert them, EuPathTables.

Alternatively, we provide Python scripts to gather all required input data in the correct format for given organism(s) using the VEuPathDB web service (https://github.com/sii-companion/eupathws).

Structure of a reference database

A reference database is technically just a directory subtree with a predefined structure. Hence it is not a real 'database' in the strongest sense, but filesystem based. It is usually identified by the topmost directory in this subtree, in which each organism in turn has its own subdirectory. For instance, for a reference directory references with three reference organisms LmjF, LbrM and Tb927 you would have the following tree:

references
├───references-in.json
├───references.json
├───_all
|    └─── all_orthomcl.out
├───LmjF
|    └─── ...
├───LbrM
|    └─── ...
└───Tb927
     └─── ...

You might have noticed the files references-in.json and references.json. These contain metadata about the reference species and are read by the pipeline to access the data in the reference database. While this structure might sound complicated, it in fact isn't. The next sections will give a short walkthrough of this process.

There should also be a file all_annotated_proteins.fasta in the parent directory of references. This should consist of all annotated proteins relevant to the species contained within the references directory, such as a relevant OrthoDB clade (https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/), and/or all annotated proteins of the child species concatenated.

Software requirements

The following software is required to create a new reference directory:

Importing reference data

To create a new reference set, the steps explained here should be sufficient. First, create a new directory for the set:

$ mkdir reference
$ cd reference

In this directory, create a new file references-in.json with the following contents (example):

{
 "species" : {
    "LmjF" : {
      "gff" : "../genomes/L_major.gff3",
      "genome" : "../genomes/L_major.fasta",
      "proteins" : "../proteins/L_major_Proteins.fasta",
      "gaf" : "../genomes/L_major.gaf",
      "name" : "Leishmania major Friedlin",
      "augustus_model" : "../../data/augustus/species/leishmania_major_sampled/",
      "chromosome_pattern" : "LmjF.(%d+)"
    },
    "LbrM" : {
      "gff" : "../genomes/L_braziliensis.gff3",
      "genome" : "../genomes/L_braziliensis.fasta",
      "proteins" : "../proteins/L_braziliensis_Proteins.fasta",
      "gaf" : "../genomes/L_braziliensis.gaf",
      "name" : "Leishmania braziliensis",
      "augustus_model": "../../data/augustus/species/leishmania_braziliensis_sampled/",
      "chromosome_pattern" : "LbrM.([0-9.]+)$"
    },
    "Tb927" : {
      "gff" : "../genomes/T_brucei_927.gff3",
      "genome" : "../genomes/T_brucei_927.fasta",
      "proteins" : "../proteins/T_brucei_927_Proteins.fasta",
      "gaf" : "../genomes/T_brucei_927.gaf",
      "name" : "Trypanosoma brucei TREU927",
      "augustus_model": "../../data/augustus/species/trypanosoma_brucei_927_sampled/",
      "chromosome_pattern" : "Tb927_(%d+)_v%d"
    },
  },
  "groups" : {
    "Leishmania" : ["LmjF", "LbrM"],
    "Trypanosoma" : ["Tb927"]
  }
}

See the JSON format documentation to learn more about the syntax of this format. Each new reference species must have an entry in the species hash, with its short abbreviation as the key. The chromosome_pattern fields must contain a Lua pattern which has to match all the chromosome header in the input sequence set, and also capture a numeric string to be interpreted as the chromosome number. This is needed later for pseudochromosome contiguation using this species as a reference. The groups hash assigns each of the reference species to a more general group, e.g. a genus. This is useful in Companion's downstream analysis steps to determine the relevant orthologous genes for tree drawing and family comparison.

The script update_references.lua in Companion's bin directory takes care of actually collecting the files referenced by the references-in.json, preprocessing them and finally copying them to their expected positions in the directory tree. Just execute the script while in the directory containing the references-in.json.

$ update_references.lua 
Importing LbrM...
Importing LmjF...
Importing Tb927...
$

The sequence, annotations and models have now been imported from the locations specified in the references-in.json file. If everything went smoothly, there should now be a references.json file in the current working directory, alongside the referenced-in.json. Check whether that file exists, is not empty, and looks lika a proper JSON file.

To use the reference set for annotation only, this is enough to get you going! Just use the directory containing the references.json file as the value of the ref_dir parameter in your Companion job parameter file.

Preprocessing orthologous groups

For some of the downstream analyses to work, you will also need a set of precomputed orthologs. These are just the cluster output of OrthoMCL 1.4. So to generate them just do something alike to the following (from your reference root directory):

mkdir _all
cd _all
find .. -name 'proteins.fasta' | xargs cat > all_prots.fasta
find .. -name 'ggline.gg' | xargs cat > all_gg.txt
formatdb -i all_prots.fasta
blastall -p blastp -m 8 -e 1e-5 -d all_prots.fasta -i all_prots.fasta > b.out 
orthomcl.pl --mode 3 --blast_file b.out --gg_file all_gg.txt

Then, move the all_orthomcl.out file from the created orthomclXXXX directory (where XXXXX is a number) to the _all directory. That's it!

Training gene predictor models

To do de novo prediction with a reference genome as the template, AUGUSTUS must be trained on reference genes.

AUGUSTUS

This HOWTO follows the general training approach introduced by: http://www.molecularevolution.org/molevolfiles/exercises/augustus/training.html

The basic set of commands is:

new_species.pl --species=my_species_code
gff2gbSmallDNA.pl reference.gff3 reference.fasta  1000 sampled.gb
etraining --species=my_species_code sampled.gb

This will create the necessary files in the corresponding species subdirectory in the AUGUSTUS_CONFIG_PATH directory that can be referenced in the references-in.json file. Please note that it might improve the performance of the model if the optimize_augustus.pl script is run on the resulting models. Please check the training HOWTO linked above for more details on this step.

Reference update pipeline

To automate the building of a reference dataset for a given VEuPathDB domain, we provide the reference update pipeline (https://github.com/sii-companion/reference-update). This will automatically go through each of the steps outlined in this guide using a Nextflow pipeline configured with a set of parameters params_default.config. Here the user can specify the domain (e.g. https://plasmodb.org/plasmo) to extract and format all references from, or gather the raw input files themselves and place them in a directory (from_local parameter in the config file) to allow the pipeline to construct the dataset from this selection only.