The Pathogen Annotation and Submission pipeline facilitates the running of several Python scripts, which validate metadata (QC), annotate assembled genomes, and submit to NCBI. Current implementation was tested using MPOX but future testing will seek to made the pipeline pathogen-agnostic.
- Overview
- Table of Contents
- Pipeline Summary
- Setup
- Quickstart
- Running the Pipeline
- Profile Options & Input Files
- Entrypoints
- Outputs
- Parameters
- Helpful Links
- Get in Touch
- Acknowledgements
The validation workflow checks if metadata conforms to NCBI standards and matches the input fasta file. The script also splits a multi-sample xlsx file into a separate .tsv file for each individual.
Currently, consists of two annotation options:
- (1) Liftoff
- The liftoff workflow annotates input fasta-formatted genomes and produces accompanying gff and genbank tbl files. The input includes the reference genome fasta, reference gff and your multi-sample fasta and metadata in .xlsx format. The Liftoff workflow was brought over and integrated from the Liftoff tool, responsible for accurately mapping annotations for assembled genomes.
- (2) VADR
- The VADR workflow annotates input fasta-formatted genomes and generates gff / tbl files. The inputs into this workflow are your multi-sample fasta, metadata in .xlsx format, and reference information for the pathogen genome which is included within this repository (found here). VADR is an existing package that was integrated into the pipeline and you can find more information about this tool at the following link: VADR Git Repo.
Submission workflow generates the necessary files for Genbank submission, generates a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline.
Before cloning, check if the following applies to you:
- CDC user with access to the Monkeypox group on Gitlab
- Require access to available submission config files
Then, follow the cloning instructions outlined here: cdc_configs_access
Otherwise, proceed with cloning the repo to your local machine:
git clone https://github.com/CDCgov/tostadas.git
Based on whether or not you are a CDC user and running on Scicomp servers, the setup steps will differ.
If you are not running the pipeline on CDC HPC, then perform steps directly below: (Non-CDC Setup), else if you are running on Scicomp servers then proceed to the setup steps under Scicomp Setup
The following steps are for running the pipeline in a non-CDC environment.
If you want to create the full-conda environment needed to run the pipeline outside of Nextflow (enables you to run individual python scripts), then proceed with steps 1-5 below.
If you simply want to run the pipeline using Nextflow only (this will be most users), then you would simply create an empty conda environment (skip step 3 but perform steps 1-2 and steps 4-5):
conda create --name tostadas
Nextflow will handle environment creation and you would only need to install the nextflow package locally vs the entire environment.
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge
export PATH="$HOME/mambaforge/bin:$PATH"
(3) Now you can create the conda environment and install the dependencies set in your environment.yml:
mamba env create -n tostadas -f environment.yml
(4) After the environment is created activate the environment. Always make sure to activate the environment with each new session.
source activate tostadas
** NOTE: You can check which environment is active by running the following conda command: conda env list
. The active environment will be denoted with an asterisk *
You need the Nextflow package to actually run the pipeline and have two options for installing it:
(5a) Using Mamba and the Bioconda Channel:
mamba install -c bioconda nextflow
(5b) Externally to mamba environment following the instructions here: Nextflow Install
The following steps are for running the pipeline on Scicomp at the CDC.
If you want to create the full-conda environment needed to run the pipeline outside of Nextflow (enables you to run individual python scripts), then proceed with the steps listed below here.
If you simply want to run the pipeline using Nextflow only (this will be most users), then you would simply initialize the nextflow module (skip all steps below):
ml nextflow
ml miniconda3
mamba env create -n tostadas -f environment.yml
conda activate tostadas
You need the Nextflow package to actually run the pipeline and can install it in the following manner:
Using mamba and the bioconda channel:
mamba install -c bioconda nextflow
The configs are set-up to run the default params with the test option
Expected Output:
nextflow version 22.10.0.5826
This is the default directory set in the nextflow.config file to allow for running the nextflow pipeline with the provided test input files.
(3) Change the submission_config
parameter within test_params.config
to the location of your personal submission config file.
** NOTE: You must have your personal submission configuration file set up before running the default parameters for the pipeline and/or if you plan on using sample submission at all. More information on setting this up can be found here: More Information on Submission
(4) Run the following nextflow command to execute the scripts with default parameters and with local run environment:
nextflow run main.nf -profile test,conda
The outputs of the pipeline will appear in the "nf_test_results" folder within the project directory (update this in the standard params set for a different output path).
** NOTE: Running the pipeline with default parameters (test) will trigger a wait time equal to # of samples * 180 seconds. This default parameter can be overridden by running the following command instead:
nextflow run main.nf -profile test,conda --submission_wait_time <place integer value here in seconds>
More information on the submission_wait_time
parameter can be found under Submission Parameters
The typical command to run the pipeline based on your custom parameters defined/saved in the standard_params.config (more information about profiles and parameter sets below) and created conda environment is as follows:
nextflow run main.nf -profile standard,conda
OR with the parameters specified in the .json/.yaml files with the following command:
nextflow run main.nf -profile standard,conda --<param name> <param value>
Other options for the run environment include docker
and singularity
. These options can be used simply by replacing the second profile option:
nextflow run main.nf -profile standard,<docker or singularity>
Either one of the above commands will launch the nextflow pipeline and show the progress of the subworkflow:process and checks looking similar to below depending on the entrypoint specified.
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [festering_spence] DSL2 - revision: 3441f714f2
executor > local (7)
[e5/9dbcbc] process > VALIDATE_PARAMS [100%] 1 of 1 Γ’Εβ
[53/a833be] process > CLEANUP_FILES [100%] 1 of 1 Γ’Εβ
[e4/a50c97] process > with_submission:METADATA_VALIDATION (1) [100%] 1 of 1 Γ’Εβ
[81/badd3b] process > with_submission:LIFTOFF (1) [100%] 1 of 1 Γ’Εβ
[d7/16d16a] process > with_submission:RUN_SUBMISSION:SUBMISSION (1) [100%] 1 of 1 Γ’Εβ
[3c/8c7ba4] process > with_submission:RUN_SUBMISSION:GET_WAIT_TIME (1) [100%] 1 of 1 Γ’Εβ
[13/85f6f3] process > with_submission:RUN_SUBMISSION:WAIT (1) [ 0%] 0 of 1
[- ] process > with_submission:RUN_SUBMISSION:UPDATE_SUBMISSION -
USING CONDA
** NOTE: The default wait time between initial submission and updating the submitted samples is three minutes or 180 seconds per sample. To override this default calculation, you can modify the submission_wait_time parameter within your config or through the command line (in terms of seconds):
nextflow run main.nf -profile <param set>,<env> --submission_wait_time 360
Outputs will be generated in the nf_test_results folder (if running the test parameter set) unless otherwise specified in your standard_params.config file as output_dir param.
This section walks through the available parameters to customize your workflow.
Input files | File type | Description |
---|---|---|
fasta | .fasta | Multi-sample fasta file with your input sequences |
metadata | .xlsx | Multi-sample metadata matching metadata spreadsheets provided in input_files |
ref_fasta | .fasta | Reference genome to use for the liftoff_submission branch of the pipeline |
ref_gff | .gff | Reference GFF3 file to use for the liftoff_submission branch of the pipeline |
Input files | File type | Description |
---|---|---|
fasta | .fasta | Multi-sample fasta file with your input sequences |
metadata | .xlsx | Multi-sample metadata matching metadata spreadsheets provided in input_files |
ref_fasta | .fasta | Reference genome to use for the liftoff_submission branch of the pipeline |
ref_gff | .gff | Reference GFF3 file to use for the liftoff_submission branch of the pipeline |
submission_config | .yaml | configuration file for submitting to NCBI, sample versions can be found in repo |
The standard_params.config file found within the conf directory is where parameters can be adjusted based on preference for running the pipeline. First you will want to ensure the file paths are correctly set for the params listed above depending on your preference for submitting your results.
- Adjust your file inputs within standard_params.config ensuring accurate file paths for the inputs listed above.
- The params can be changed within the standard_params.config or you can change the standard.yml/standard.json file inside the params directory and pass it in with:
-params-file <standard_params.yml or standard_params.json>
- Note: DO NOT EDIT the main.nf file or other paths in the nextflow.config unless familiar with editing nextflow workflows
Within the nextflow pipeline the -profile
option is required as an input. The profile options with the pipeline include test and standard. These two options can be seen listed in the nextflow.config file. The test params should remain the same for testing purposes, but the standard profile can be changed to fit user preferences. Also within the nextflow pipeline there is the use of varying run environments as the second profile input. Nextflow expects at least one option for both of these configurations to be passed in: -profile <test/standard>,<conda/docker/singularity>
Now that your file paths are set within your standard.yml or standard.json or standard_params.config file, you will want to define whether to run the full pipeline with submission or without submission. This is defined within the standard_params.config file underneath the subworkflow section as run_submission run_submission = true/false
- Apart from this main bifurcation, there exists entrypoints that you can use to access specific processes. More information is listed in the table below.
The submission piece of the pipeline uses the processes that are directly integrated from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below.
(A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS:
- NCBI: If uploading to NCBI, an account is required along with a center account approved for submitting via FTP. Contact the following for account creation:[email protected].
- GISAID: A GISAID account is required for submission to GISAID, you can register for an account at https://www.gisaid.org/. Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at [email protected] to recieve a personal CID. GISAID support is not yet implemented but it may be added in the future.
(B) Config File Set-up:
- The template for the submission config file can be found in bin/default_config_files within the repo. This is where you can edit the various parameters you want to include in your submission.
Table of entrypoints available for the nextflow pipeline:
Workflow | Description |
---|---|
only_validate_params | Validates parameters utilizing the validate params process within the utility sub-workflow |
only_cleanup_files | Cleans-up files utilizing the clean-up process within the utility sub-workflow |
only_validation | Runs the metadata validation process only |
only_liftoff | Runs the liftoff annotation process only |
only_vadr | Runs the VADR annotation process only |
only_submission | Runs submission sub-workflow only. Requires specific inputs mentioned here: Required Files for Submission Entrypoint |
only_initial_submission | Runs the initial submission process but not follow-up within the submission sub-workflow. Requires specific inputs mentioned here: Required Files for Submission Entrypoint |
only_update_submission | Updates NCBI submissions. Requires specific inputs mentioned here: Required Files for Submission Entrypoint |
- Documentation for using entrypoints with NF can be found at Nextflow_Entrypoints under section 5.
The following command can be used to specify entrypoints for the workflow:
nextflow run main.nf -profile <param set>,<env> -entry <insert option from table above>
If you are using the only_submission
or only_initial_submission
entrypoint, you must define the paths for the following parameters:
submission_only_meta
: path to the directory containing validated metadata files (one .tsv per sample)submission_only_fasta
: path to the directory containing split fasta files (one .fasta per sample)submission_only_gff
: path to the directory containing the cleaned and reformatted GFF files (one .gff per sample)
It is preferred to use $projectDir
to prefix the defined paths, which is a built-in variable for encoding the path to the main.nf file. Then, you can simply append it with the path relative to main.nf.
For example, if your files are located in a directory named test_files immediately below the level where main.nf is, with separate directories for fasta, gff, and metadata files (called fasta, gff, and meta) inside it, then you would use the following:
submission_only_meta = $projectDir/test_files/meta
submission_only_fasta = $projectDir/test_files/fasta
submission_only_gff = $projectDir/test_files/gff
You do have the option to use either relative paths (from where you are running the pipeline) or absolute paths, but this may introduce issues when running it on certain cloud/HPC environments.
If you are using only_update_submission
entrypoint, you must define the following parameter:
processed_samples
: path to the directory containing outputs from initial submission
The following section walks through the outputs from the pipeline.
The workflow will generate outputs in the following order:
- Validation
- Responsible for QC of metadata
- Aligns sample metadata .xlsx to sample .fasta
- Formats metadata into .tsv format
- Annotation
- Extracts features from .gff
- Aligns features
- Annotates sample genomes outputting .gff
- Submission
- Formats for database submission
- This section runs twice, with the second run occuring after a wait time to allow for all samples to be uploaded to NCBI. Entrypoint
only_update_submission
can be run as many times as necessary until all files are fully uploaded.
The outputs are recorded in the directory specified within the nextflow.config file and will contain the following:
- validation_outputs (**name configurable with val_output_dir)
- name of metadata sample file
- errors
- tsv_per_sample
- name of metadata sample file
- liftoff_outputs (**name configurable with final_liftoff_output_dir)
- name of metadata sample file
- errors
- fasta
- liftoff
- tbl
- name of metadata sample file
- vadr_outputs (**name configurable with vadr_output_dir)
- name of metadata sample file
- errors
- fasta
- gffs
- tbl
- name of metadata sample file
- submission_outputs (**name and path configurable with submission_output_dir)
- name of annotation results (Liftoff or VADR, etc.)
- individual_sample_batch_info
- biosample_sra
- genbank
- accessions.csv
- terminal_outputs
- commands_used
- individual_sample_batch_info
- name of annotation results (Liftoff or VADR, etc.)
The pipeline outputs inlcude:
- metadata.tsv files for each sample
- separate fasta files for each sample
- separate gff files for each sample
- separate tbl files containing feature information for each sample
- submission log file
- This output is found in the submission_outputs file in your specified output_directory
- If the file can not be found you can run the only_update_submission entrypoint for the pipeline
Default parameters are given in the nextflow.config file. This table lists the parameters that can be changed to a value, path or true/false. When changing these parameters pay attention to the required inputs and make sure that paths line-up and values are within range. To change a parameter you may change with a flag after the nextflow command or change them within your standard_params.config or standard.yaml file.
- Please note the correct formatting and the default calculation of submission_wait_time at the bottom of the params table.
Param | Description | Input Required |
---|---|---|
--fasta_path | Path to fasta file | Yes (path as string) |
--ref_fasta_path | Reference Sequence file path | Yes (path as string) |
--meta_path | Meta-data file path for samples | Yes (path as string) |
--ref_gff_path | Reference gff file path for annotation | Yes (path as string) |
--env_yml | Path to environment.yml file | Yes (path as string) |
Param | Description | Input Required |
---|---|---|
--scicomp | Flag for whether running on Scicomp or not | Yes (true/false as bool) |
--docker_container | Name of the Docker container | Yes, if running with docker profile (name as string) |
--docker_container_vadr | Name of the Docker container to run VADR annotation | Yes, if running with docker profile (name as string) |
Param | Description | Input Required |
---|---|---|
--run_submission | Toggle for running submission | Yes (true/false as bool) |
--run_liftoff | Toggle for running liftoff annotation | Yes (true/false as bool) |
--run_vadr | Toggle for running vadr annotation | Yes (true/false as bool) |
--cleanup | Toggle for running cleanup subworkflows | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--clear_nextflow_log | Clears nextflow work log | Yes (true/false as bool) |
--clear_nextflow_dir | Clears nextflow working directory | Yes (true/false as bool) |
--clear_work_dir | Param to clear work directory created during workflow | Yes (true/false as bool) |
--clear_conda_env | Clears conda environment | Yes (true/false as bool) |
--clear_nf_results | Remove results from nextflow outputs | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--output_dir | File path to submit outputs from pipeline | Yes (path as string) |
--overwrite_output | Toggle to overwriting output files in directory | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--val_output_dir | File path for outputs specific to validate sub-workflow | Yes (folder name as string) |
--val_date_format_flag | Flag to change date output | Yes (-s, -o, or -v as string) |
--val_keep_pi | Flag to keep personal identifying info, if provided otherwise it will return an error | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--final_liftoff_output_dir | File path to liftoff specific sub-workflow outputs | Yes (folder name as string) |
--lift_print_version_exit | Print version and exit the program | Yes (true/false as bool) |
--lift_print_help_exit | Print help and exit the program | Yes (true/false as bool) |
--lift_parallel_processes | # of parallel processes to use for liftoff | Yes (integer) |
--lift_delete_temp_files | Deletes the temporary files after finishing transfer | Yes (true/false as string) |
--lift_child_feature_align_threshold | Only if its child features usually exons/CDS align with sequence identity Γ’β°Β₯S | designate a feature mapped |
--lift_unmapped_feature_file_name | Name of unmapped features file name | Yes (path as string) |
--lift_copy_threshold | Minimum sequence identity in exons/CDS for which a gene is considered a copy; must be greater than -s; default is 1.0 | Yes (float) |
--lift_distance_scaling_factor | Distance scaling factor; by default D =2.0 | Yes (float) |
--lift_flank | Amount of flanking sequence to align as a fraction of gene length | Yes (float between [0.0-1.0]) |
--lift_overlap | Maximum fraction of overlap allowed by 2 features | Yes (float between [0.0-1.0]) |
--lift_mismatch | Mismatch penalty in exons when finding best mapping; by default M=2 | Yes (integer) |
--lift_gap_open | Gap open penalty in exons when finding best mapping; by default GO=2 | Yes (integer) |
--lift_gap_extend | Gap extend penalty in exons when finding best mapping; by default GE=1 | Yes (integer) |
--lift_infer_transcripts | Use if annotation file only includes exon/CDS features and does not include transcripts/mRNA | Yes (True/False as string) |
--lift_copies | Look for extra gene copies in the target genome | Yes (True/False as string) |
--lift_minimap_path | Path to minimap if you did not use conda or pip | Yes (N/A or path as string) |
--lift_feature_database_name | Name of the feature database, if none, then will use ref gff path to construct one | Yes (N/A or name as string) |
Param | Description | Input Required |
---|---|---|
--vadr_output_dir | File path to vadr specific sub-workflow outputs | Yes (folder name as string) |
--vadr_models_dir | File path to models for MPXV used by VADR annotation | Yes (folder name as string) |
Param | Description | Input Required |
---|---|---|
--submission_output_dir | Either name or relative/absolute path for the outputs from submission | Yes (name or path as string) |
--submission_prod_or_test | Whether to submit samples for test or actual production | Yes (prod or test as string) |
--submission_only_meta | Full path directly to the dirs containing validate metadata files | Yes (path as string) |
--submission_only_gff | Full path directly to the directory with reformatted GFFs | Yes (path as string) |
--submission_only_fasta | Full path directly to the directory with split fastas for each sample | Yes (path as string) |
--submission_config | Configuration file for submission to public repos | Yes (path as string) |
--submission_wait_time | Calculated based on sample number (3 * 60 secs * sample_num) | integer (seconds) |
--batch_name | Name of the batch to prefix samples with during submission | Yes (name as string) |
--send_submission_email | Toggle email notification on/off | Yes (true/false as bool) |
--req_col_config | Path to the required_columns.yaml file | Yes (path as string) |
--processed_samples | Path to the directory containing processed samples for update only submission entrypoint (containing <batch_name>.<sample_name> dirs) | Yes (path as string) |
** Important note about send_submission_email
: An email is only triggered if Genbank is being submitted to AND table2asn is the genbank_submission_type. As for the recipient, this must be specified within your submission config file under 'general' as 'notif_email_recipient'
π Anaconda Install: https://docs.anaconda.com/anaconda/install/
π Nextflow Documentation: https://www.nextflow.io/docs/latest/getstarted.html
π SeqSender Documentation: https://github.com/CDCgov/seqsender
π Liftoff Documentation: https://github.com/agshumate/Liftoff
π VADR Documentation: https://github.com/ncbi/vadr.git
π table2asn Documentation: https://github.com/svn2github/NCBI_toolkit/blob/master/src/app/table2asn/table2asn.cpp
If you have any ideas for ways to improve our existing codebase, feel free to open an Issue Request (found here: Open New Issue)
Following the link above, there are four options for issue templates and your selection will depend on (1) if you are a user vs maintainer/collaborator and (2) if the request pertains to a bug vs feature enhancement. Please select the template that accurately reflects your situation.
Once the appropriate template has been selected, you must fill/answer all fields/questions specified. The information provided will be valuable in getting more information about the issue and any necessary context surrounding it.
Once all information has been provided, you may now submit it!
Please allow for some turnaround time for us to review the issue and potentially start addressing it. If this is an urgent request and have not heard from us nor see any progress being made after quite some time (longer than a week), feel free to start a discussion (found here: Start New Discussion) mentioning the following:
- Issue Number
- Date Submitted
- General Background on Bug/Feature
- Reason for Urgency
And we will get back to you as soon as possible.
Michael Desch | Ethan Hetrick | Nick Johnson | Kristen Knipe | Shatavia Morrison
Yuanyuan Wang | Michael Weigand | Dhwani Batra | Jason Caravas | Ankush Gupta
Kyle O'Connell | Yesh Kulasekarapandian | Cole Tindall | Lynsey Kovar | Hunter Seabolt
Crystal Gigante | Christina Hutson | Brent Jenkins | Yu Li | Ana Litvintseva
Matt Mauldin | Dakota Howard | Ben Rambo-Martin | James Heuser | Justin Lee | Mili Sheth