diff --git a/docs/getintouch.md b/docs/getintouch.md index 574a982e..fa685155 100644 --- a/docs/getintouch.md +++ b/docs/getintouch.md @@ -14,8 +14,8 @@ Once all information has been provided, you may now submit it! Please allow for some turnaround time for us to review the issue and potentially start addressing it. If this is an urgent request and have not heard from us nor see any progress being made after quite some time (longer than a week), feel free to start a discussion (found here: Start New Discussion) mentioning the following: -Issue Number -Date Submitted -General Background on Bug/Feature -Reason for Urgency +* Issue Number +* Date Submitted +* General Background on Bug/Feature +* Reason for Urgency And we will get back to you as soon as possible. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md index ca5827df..3def6759 100644 --- a/docs/index.md +++ b/docs/index.md @@ -3,35 +3,35 @@ ## Overview TOSTADAS is designed to fulfill common sequence submission use cases. The tool runs three sub-processes: -Metadata Validation – This workflow checks if metadata conforms to NCBI standards and matches the input .fasta file(s) -Gene Annotation – This workflow runs gene annotation on fasta-formatted genomes using one of three annotation methods: RepeatMasker and Liftoff, VADR or BAKTA -Submission – This workflow generates the necessary files and information for submission to NCBI and optionally and optionally submit to NCBI. +* Metadata Validation – This workflow checks if metadata conforms to NCBI standards and matches the input .fasta file(s) +* Gene Annotation – This workflow runs gene annotation on fasta-formatted genomes using one of three annotation methods: RepeatMasker and Liftoff, VADR or BAKTA +* Submission – This workflow generates the necessary files and information for submission to NCBI and optionally and optionally submit to NCBI. TOSTADAS is flexible, allowing you to choose which portions of the pipeline to run and which to skip. For example, you can submit .fastq files and metadata without performing gene annotation. The current distribution has been tested with Pox virus sequences as well as some bacteria. Ongoing development aims to make the pipeline pathogen agnostic. ## Pipeline Summary -### Metadata Validation +Metadata Validation The validation workflow checks that user provided metadata conforms to NCBI standards and matches the input data file(s). To allow for easy multi-sample submission, TOSTADAS can split a multi-sample Excel (.xlsx) file into separate tab delimited files (.tsv) for each individual sample. -TOSTADAS can accept custom metadata fields specific to a users' pathogen, sample type, or workflow. Additionally, TOSTADAS offers powerful validation tools for user- created fields, allowing users to specify which samples to apply rules to, replace empty values with user specified replacements, rename existing fields and other operations. These features can be enabled with the validate_custom_fields parameter. Custom fields can be specified using the custom_fields_file parameter. +TOSTADAS can accept custom metadata fields specific to a users' pathogen, sample type, or workflow. Additionally, TOSTADAS offers powerful validation tools for user- created fields, allowing users to specify which samples to apply rules to, replace empty values with user specified replacements, rename existing fields and other operations. These features can be enabled with the `validate_custom_fields` parameter. Custom fields can be specified using the `custom_fields_file` parameter. A full guide to using custom metadata fields can be found here: [Custom Metadata Guide](https://github.com/CDCgov/tostadas/blob/457242fb15973f69cb3578367317a8b5e7c619f7/docs/custom_metadata_guide.md) -### Gene Annotation +## Gene Annotation TOSTADAS offers three optional annotation options: -#### RepeatMasker and Liftoff +### RepeatMasker and Liftoff -The RepeatMasker and Liftoff workflow annotates fasta-formatted sequences based upon a provided reference and annotation file. This workflow was optimized for variola genome annotation and may require modification for other pathogens. This workflow runs RepeatMasker to annotate repeat motifs, followed by Liftoff to annotate functional regions. These results are combined into a single feature file (.gff3). The Liftoff annotation workflow requires a reference genome (.fasta), reference feature .gff, single sample .fasta files, and metadata in Excel .xlsx format. Be sure to specify the correct database in the params for this option. -[RepeatMasker and Liftoff Example](Link) +The RepeatMasker and Liftoff workflow annotates fasta-formatted sequences based upon a provided reference and annotation file. This workflow was optimized for variola genome annotation and may require modification for other pathogens. This workflow runs [RepeatMasker](https://www.repeatmasker.org/) to annotate repeat motifs, followed by [Liftoff](https://github.com/agshumate/Liftoff) to annotate functional regions. These results are combined into a single feature file (.gff3). The Liftoff annotation workflow requires a reference genome (.fasta), reference feature .gff, single sample .fasta files, and metadata in Excel .xlsx format. Be sure to specify the correct database in the params for this option. +[RepeatMasker and Liftoff Example] (Link) -#### VADR +### VADR The VADR workflow annotates fasta-formatted viral genomes using RefSeq annotation from a set of homologous reference models. This workflow requires single sample fasta files, metadata in .xlsx format, and reference information for the pathogen genome. TOSTADAS comes packaged with support for [monkeypox (mpxv) annotation] (https://github.com/CDCgov/tostadas/tree/master/vadr_files/mpxv-models). You can find information on other supported pathogens at the [VADR GitHub Repository] (https://github.com/ncbi/vadr). [VADR Example] (Link) -#### Bakta +### Bakta The Bakta workflow annotates fasta-formatted bacterial genomes & plasmids using the Bakta software. This workflow requires single sample .fasta files, metadata in .xlsx format, and optional reference database for annotation (found here). [BAKTA Example] (Link) @@ -39,4 +39,4 @@ The Bakta workflow annotates fasta-formatted bacterial genomes & plasmids using All annotation workflows produce a general feature format file (.gff3) and NCBI feature table (tbl) compatible with NCBI submission requirements. ## Submission -The TOSTADAS Submission workflow generates the necessary files for Genbank submission, a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline. +The TOSTADAS Submission workflow generates the necessary files for Genbank submission, a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline. \ No newline at end of file diff --git a/docs/installation.md b/docs/installation.md index 68750093..79f4ee15 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -3,16 +3,16 @@ ## Environment Setup ❗ Note: If you are a CDC user, please follow the set-up instructions found on Page X - CDC User Guide -(1) Clone the repository to your local machine: +### (1) Clone the repository to your local machine: * `git clone https://github.com/CDCgov/tostadas.git` ❗ Note: If you have mamba or nextflow installed in your local environment, you may skip steps 2, 3 (mamba installation) and 6 (nextflow installation) accordingly. -(2) Install Mamba: +### (2) Install Mamba: * `curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh` * `bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge` -(3) Add mamba to PATH: +### (3) Add mamba to PATH: * `export PATH="$HOME/mambaforge/bin:$PATH"` -(4) Create the conda environment: +### (4) Create the conda environment: If you want to create the full-conda environment needed to run the pipeline outside of Nextflow (this enables you to run individual python scripts), then proceed with step 4a. If you want to run the pipeline using nextflow only (this will be most users), proceed with step 4b. Nextflow will handle environment creation and you would only need to install the nextflow package locally vs the entire environment. @@ -24,25 +24,25 @@ If you want to run the pipeline using nextflow only (this will be most users), p (4b) Create an empty conda environment: * `conda create --name tostadas` -(5) Activate the environment. +### (5) Activate the environment. * `conda activate tostadas` Verify which environment is active by running the following conda command: * `conda env list` . The active environment will be denoted with an asterisk * -(6) Install Nextflow using Use Mamba and the Bioconda Channel: +### (6) Install Nextflow using Use Mamba and the Bioconda Channel: * `mamba install -c bioconda nextflow` ❗ Optionally, you may install nextflow without mamba by following the instructions found in the Nextflow Installation Documentaion Page: ## Nextflow Install -(7) Ensure Nextflow was installed successfully by running nextflow -v +### (7) Ensure Nextflow was installed successfully by running nextflow -v Expected Output: * `nextflow version ` The exact version of Nextflow returned will differ from installation to installation. It is important that the command execute successfully, and a version number is returned. -(8) Run one of the following nextflow commands to execute the scripts with default parameters and the local run environment: -### For Virus Reads +### (8) Run one of the following nextflow commands to execute the scripts with default parameters and the local run environment: +#### For Virus Reads * `nextflow run main.nf -profile test, --virus` -### For Bacterial Reads +#### For Bacterial Reads * `nextflow run main.nf -profile test, --bacteria` -The outputs of the pipeline will appear in the test_output folder within the project directory. You can specify an output directory in the config file or by supplying a path to the --output_dir flag in your nextflow run command. \ No newline at end of file +The outputs of the pipeline will appear in the test_output folder within the project directory. You can specify an output directory in the config file or by supplying a path to the `--output_dir` flag in your nextflow run command. \ No newline at end of file