-
Notifications
You must be signed in to change notification settings - Fork 52
3a. Running DRAM
Once DRAM is set up, you are ready to annotate some MAGs. The following command will generate your full annotation:
DRAM.py annotate -i 'my_bins/*.fa' -o annotation
my_bins
should be replaced with the path to a directory which contains all of your bins you would like to annotate. If you only need to annotate a single genome (or an entire assembly) a direct path to a nucleotide fasta should be provided. Using 20 processors, DRAM.py takes about 17 hours to annotate ~80 MAGs of medium quality or higher from a mouse gut metagenome.
In the output annotation
folder, there will be various files. genes.faa
and genes.fna
are fasta files with all genes called by prodigal with additional header information gained from the annotation as nucleotide and amino acid records respectively. genes.gff
is a GFF3 with the same annotation information as well as gene locations. scaffolds.fna
is a collection of all scaffolds/contigs given as input to DRAM.py annotate
with added bin information in the headers. annotations.tsv
is the most important output of the annotation. This includes all annotation information about every gene from all MAGs. Each line is a different gene and each column contains annotation information. trnas.tsv
contains a summary of the tRNAs and rrnas.tsv
contains a summary of the rRNAs found in each MAG and.
- Input files: Key input quality files
- Input fasta: fasta file or a string with wildcards (e.g. ‘MAGs/*.fa’) that leads to multiple fastas
- Bin taxonomy: a tab separated table with the first column being the names of the bins used as input to DRAM and another column with the header 'classification'. The output of GTDB-tk is already in this format and can be used directly.
- Bin quality: a tab separated table with the first column being the names of the bins used as input to DRAM and additional columns with the headers 'Completeness' giving genome completeness information and 'Contamination' giving genome contamination information. The output of checkM is already in this format and can be used directly.
- Output directory: Folder to be created that will store output files
- Technical Parameters: These parameters effect results and resource use
- Minimum contig size: Used for gene prediction, the defalt is 2500,
- Minimum bit score for MMSeqs2 searches: 60
- Minimum bit score for reverse best hit MMSeqs2 searches: 350
- Number of threads to use: 10
- Speed and Resources: These arguments make DRAM run faster but at a cost.
- Use Uniref: Drastically decreases run time and memory requirements if set to False: True
- Low Mem Mode, Skip annotating with uniref, use kofam over KEGG genes: False
- Skip Trnascan: False
- Troubleshooting: These additional arguments are popular for more specialized analyses.
- Keep Tmp Dir', action='store_true', default=False)
- Threads, number of processors to use: 10
- Specialization: These additional arguments are popular for more specialized analyses.
- Custom fasta Databases Locations: Used in blast style searches, see "Additional Databases" section.
- Custom fasta Databases Names: Used in blast style searches, see "Additional Databases" section.
- Custom HMM Databases Names: Used in HMM searches, see "Additional Databases" section.
- Custom HMM Databases Locations: Used in HMM searches, see "Additional Databases" section.
- Custom HMM Cutoff Locations: Used in HMM searches, see "Additional Databases" section.
- GTDB taxonomy,'Summary file from gtdbtk taxonomy assignment from bins, can be used multiple times'
- Less Popular: These arguments are rarely used, but you may need them in some cases
- Mode of prodigal:
- Translation table for prodigal.
- Tab separated file (.tsv) with all the annotations from Pfam, KEGG, UniProt, dbCAN, and MEROPS databases for all genes in all the input genomes
- GenBank files for each genome
- Single gene-finding format (.gff) file of all annotations across genomes
- Single fasta format file (.fasta) of each open reading frame nucleotide sequence and best ranked annotation (see Annotation grades section)
- Single fasta format file (.fasta) of each translated open reading frame amino acid sequence and best ranked annotation KEGG annotation
- Tab separated files (.tsv) with tRNAs and rRNAs
DRAM has the ability to load a config file at runtime, to be used for that run alone and not stored. This addresses two key issues. DRAM’s config file is kept in the same directory as the installed DRAM package and this can be a problem for many users depending on how their system is set up and who has the ability to install and update packages. Additionally, some users need to use more than one DRAM database at one time which is impossible to do elegantly by changing the config for all users on a system
You can build a config file in the standard manner and export it DRAM-setup.py export_config > my_config.json
, or use an empty config file available here on GitHub and fill in the paths yourself. A more elegant solution will be provided in DRAM2.
Assuming you have created a config file in json format named my_conf.json you can run DRAM annotation with it in 2 ways.
DRAM_CONFIG_LOCATION=”/full/path/my_config.json”
# OR
DRAM_CONFIG_LOCATION="$(realpath ./my_config.json)"
Then run DRAM annotation normally. This has the downside of having a hidden effect on all DRAM runs after the environment variable is set so be shure to use unset
to remove it when you are done.
Simply run DRAM annotate with the addition of the --config_loc
argument
DRAM.py annotate --config_loc ./my_config.json ….
You can now specify additional databases to be used with DRAM! Both custom HMMs and fastas can be specified for DRAM to annotate against.
Specify the location of a fasta file to search against with the --custom_fasta_loc
argument, and give it a name with the --custom_db_name
argument. The fasta will be annotated against with a blast stile search using mmseqs2, and the results will appear in the annotations.tsv under the column <name>_id
where <name>
is the string passed to the --custom_fasta_loc
argument.
Similarly, you can specify the location of a custom HMM database to be searched using the --custom_hmm_loc
argument. The name of the HMM database can be specified with the --custom_hmm_name
argument, and the results of the mmseqs2 hmm search will be found in the annotations.tsv under the column <name>_id
, where <name>
is specified by --custom_hmm_name
and the hit will be the id from the hmm. You can also pass a custom cutoffs file with the --custom_hmm_cutoffs_loc
argument.
The hmm cutoffs file must be a tab separated volume (tsv) with 3 columns:
- An index column with no duplicates, where each entry is the ID of a HMM profile in the HMM file. This must be the first column, but needs no header.
-
score_type
a string being 'domain' or 'full' to indicate which score computed by mmseqs should be compared to the value in the 'threshold' column. You can use '-' if there should be no evaluation and the search hits should all be included. -
threshold
A column holding threshold values, scores below this value will be removed, which score is controlled by score 'type'. - 'definition' an optional extra definition that will appear in the column
<name>_hits
It is important to note that one hmm file can have as many HMM profiles as you desire, but if you provide a custom cutoffs file, you need to provide a row for each hmm.
All arguments can be specified multiple times for multiple files for as many files as you want. The name arguments must match the order of the location arguments. The custom cutoffs locations for hmms should also be in matching order to their hmm location arguments. If you have more hmms than cutoff files, the cutoffs will be applied to the first hmm listed. The example below should provide clarity on the subject.
We have created several new integrated databases that will be available in DRAM2 when released. However, with the custom database feature, you can use one new metabolism for methylation now! In order to do this, you will need to use both a custom search database and a custom distillation sheet (learn more here) and a custom search db.
You must use DRAM 1.4.0 or later for this but, provided you have a compatible version setup you can use this new metabolism with commands like those below.
To Annotate with methyl, do something
wget https://raw.githubusercontent.com/WrightonLabCSU/DRAM/master/data/methylotrophy/methylotrophy.faa
DRAM.py annotate -i '/some/path/*.fasta' -o dram_output --threads 30 --custom_db_name methyl --custom_fasta_loc methylotrophy.faa
To Distill with methyl:
wget https://raw.githubusercontent.com/WrightonLabCSU/DRAM/master/data/methylotrophy/methylotrophy_distillate.tsv
DRAM.py distill -i dram_output/annotations.tsv -o dram_output/distillate --custom_distillate methylotrophy_distillate.tsv
If all the instructions were followed then a new sheet will appear in your metabolism summary XLSX file named Methyl, with full distillate information.
Suppose you have 2 FASTAs fasta_1.fa
and fasta_2.fa
you want to name them A_seqs and B_seqs respectively. You also have 2 hmms profils_cutoffs.hmm
and profiles_bitscor.hmm
, the first hmm you want to use custom cutoffs in a tsv file named cutoffs.tsv
and the second you want to use bit score, you want to name them hmm_A, and hmm_B. The command that you would need to use is then.
# Assume all files are in the working directory.
DRAM.py annotate \
-i <input path or regex> \
-o <output path> \
--threads 20 \
--custom_fasta_loc ./fasta_1.fa \
--custom_db_name A_seqs \
--custom_fasta_loc ./fasta_2.fa \
--custom_db_name B_seqs \
--custom_hmm_name hmm_A \
--custom_hmm_loc ./profils_cutoffs.hmm\
--custom_hmm_cutoffs_loc ./cutoffs.tsv \
--custom_hmm_name hmm_B \
--custom_hmm_loc ./profiles_bitscor.hmm # this hmm will use bitscores
Again, the order of the arguments determines the interpretation.
You can annotate already called genes in DRAM, if necessary, and get most of the same output, the gff output will not be generated for example but annotations will be made and distillation will be possible. You may want to check that the names of your genes are to your liking in the FAA file as those names will be carried over and may be changed by the software that you generated the file with, including Dram. You should also note that this command works on only one faa at a time. As of the 1.3 release custom HMMs are not supported though custom FASTAs are. The commands are otherwise the same. See the example below.
Suppose you have a FASTA of Amino Acid Sequences fasta.faa
. The command that you would need to use is then.
# Assume all files are in the working directory.
DRAM.py annotate-genes \
-i <input path to the genes faa, NO REGEX> \
-o <output path> \
--threads 20 \
--custom_fasta_loc ./fasta_1.fa \
--custom_db_name A_seqs \
--custom_fasta_loc ./fasta_2.fa \
--custom_db_name B_seqs \
After your annotation is finished, you can summarize these annotations with the following command:
DRAM.py distill -i annotation/annotations.tsv -o genome_summaries --trna_path annotation/trnas.tsv --rrna_path annotation/rrnas.tsv
This command will generate three files. The first is called genome_summary.xlsx
which contains a summary of metabolisms present in each genome. It gives gene by gene information across various metabolisms for every genome in your dataset.
The genome_statistics.tsv
file contains all measures required by the MIMAG about each fasta used as input. Finally, the liquor.html is an interactive html that allows users to hover over each box to see what genes prompted the box color (Example here) and was manually curated to consider alternate genes for pathways and single processes. This heat map allows the user to quickly profile ecosystem relevant processes across hundreds of genomes.
- Input files
- Annotations: annotations.tsv file generated during the annotate step
- Output directory: Directory to create and write outputs
- rRNA path: rrnas.tsv file generated during the annotation step (optional)
- tRNA path: trnas.tsv file generated during the annotation step (optional)
- Default Parameters
- Grouping: fasta column from annotations.tsv file
- Distillate
- Tab separated file (.tsv) with genome statistics for all input genomes, including all statistics required by recently defined MIMAG standards (Bowers, et al.)
- Tab separated file (.tsv) with metabolism summary of all input genomes, which gives gene counts of functional and structural genes across a wide variety of metabolisms
- Liquor
- HTML (.html) file containing an interactive heatmap showing coverage of pathways, the coverage of electron transport chain components, and the presence of selected metabolic functions
- Tab separated file (.tsv) with corresponding genes for the html heatmap.
- --custom_distillate', help="Custom distillate form to add your own modules")
DRAM now has the ability to use custom distillation modules, for users to specify their own genes of interest.
These custom modules can be passed to the DRAM distill command with the argument --custom_distillate <path to distillate>
.
If the custom distillate sheet is properly specified, then a new module will be created in the distillate, and you will be able to see the results in both the product and metabolism summary.
The custom distillate sheet should be a tab separated volume(tsv) file with all the columns specified below. This sheet is an extension of the genome_summary_form.tsv that comes with dram and which is one of several key data files that characterizes the distillate. In the future, all these files may be customizable. If this is a route that you want to go, you should look at the (genome_summary_form.tsv)[https://github.com/WrightonLabCSU/DRAM/blob/master/data/genome_summary_form.tsv] to understand the format. The columns of this form are simply:
* gene_id: The KO ids of the genes you are interested in
* gene_description: Descriptions of the geans
* module: The name of your module that you are adding
* sheet: The name you would like on the excel sheet in which your results appear. If you use an existing sheet name like MISC you may have a hard time finding your results amongst the default results on that sheet.
* header: The header that will appear in the dram metabolism summary
* subheader: The sub header that will appear in the metabolism summary
After you have completed your annotation and distillation, you may want to further analyze genes of interest by making trees or functional modeling. To pull the genes you can use DRAM.py strainer. For example, if you want to pull all pmoa/amoa genes based on KEGG annotations you can make a tree:
DRAM.py strainer --identifiers K10944 -i annotations.tsv -f genes.faa -o amoa_pmoa_genes.faa
Or you might want to blast a few specific genes:
DRAM.py strainer --genes bin.2_scaffold_2_3 bin.4_scaffold_12_42 -i annotations.tsv -f genes.dna -on my_genes.fna
Or maybe you only want to see genes that are involved in glycolysis or the TCA cycle that are from bins from the Roseburia genus:
DRAM.py strainer -i hmp_bins/annotations.tsv -f hmp_bins/genes.fna -o genes.roseburia.glycoloysis_tca.fna --taxonomy g__Roseburia --categories glycolysis TCA
- Input files:
- Annotations: annotations.tsv file generated during the annotate step
- Input fasta: genes fasta file (.faa or .fna) to be filtered
- Output fasta: location to save filtered fasta file
- Default Parameters:
- Fastas: None
- Scaffolds: None
- Genes: None
- Identifiers: None