diff --git a/docs/getintouch.md b/docs/getintouch.md deleted file mode 100644 index 283593ad..00000000 --- a/docs/getintouch.md +++ /dev/null @@ -1,21 +0,0 @@ -# Get in Touch - -If you have any ideas for ways to improve our existing codebase, feel free to open an Issue Request (found here: [Open New Issue](https://github.com/CDCgov/tostadas/issues/new/choose)) - -## Steps to Open Issue Request: -### (1) Select Appropriate Template -Following the link above, there are four options for issue templates and your selection will depend on (1) if you are a user vs maintainer/collaborator and (2) if the request pertains to a bug vs feature enhancement. Please select the template that accurately reflects your situation. - -### (2) Fill Out Necessary Information -Once the appropriate template has been selected, you must fill/answer all fields/questions specified. The information provided will be valuable in getting more information about the issue and any necessary context surrounding it. - -### (3) Submit the Issue -Once all information has been provided, you may now submit it! - -Please allow for some turnaround time for us to review the issue and potentially start addressing it. If this is an urgent request and have not heard from us nor see any progress being made after quite some time (longer than a week), feel free to start a discussion (found here: Start New Discussion) mentioning the following: - -* Issue Number -* Date Submitted -* General Background on Bug/Feature -* Reason for Urgency -And we will get back to you as soon as possible. \ No newline at end of file diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index 6677ee39..00000000 --- a/docs/index.md +++ /dev/null @@ -1,42 +0,0 @@ -# Home - -## Overview -TOSTADAS is designed to fulfill common sequence submission use cases. The tool runs three sub-processes: - -* Metadata Validation – This workflow checks if metadata conforms to NCBI standards and matches the input .fasta file(s) -* Gene Annotation – This workflow runs gene annotation on fasta-formatted genomes using one of three annotation methods: RepeatMasker and Liftoff, VADR or BAKTA -* Submission – This workflow generates the necessary files and information for submission to NCBI and optionally and optionally submit to NCBI. -TOSTADAS is flexible, allowing you to choose which portions of the pipeline to run and which to skip. For example, you can submit .fastq files and metadata without performing gene annotation. - -The current distribution has been tested with Pox virus sequences as well as some bacteria. Ongoing development aims to make the pipeline pathogen agnostic. - -## Pipeline Summary -### Metadata Validation -The validation workflow checks that user provided metadata conforms to NCBI standards and matches the input data file(s). To allow for easy multi-sample submission, TOSTADAS can split a multi-sample Excel (.xlsx) file into separate tab delimited files (.tsv) for each individual sample. - -TOSTADAS can accept custom metadata fields specific to a users' pathogen, sample type, or workflow. Additionally, TOSTADAS offers powerful validation tools for user- created fields, allowing users to specify which samples to apply rules to, replace empty values with user specified replacements, rename existing fields and other operations. These features can be enabled with the `validate_custom_fields` parameter. Custom fields can be specified using the `custom_fields_file` parameter. - -A full guide to using custom metadata fields can be found here: [Custom Metadata Guide](https://github.com/CDCgov/tostadas/blob/457242fb15973f69cb3578367317a8b5e7c619f7/docs/custom_metadata_guide.md) - -## Gene Annotation -TOSTADAS offers three optional annotation options: - -### 1. RepeatMasker and Liftoff - -* The RepeatMasker and Liftoff workflow annotates fasta-formatted sequences based upon a provided reference and annotation file. This workflow was optimized for variola genome annotation and may require modification for other pathogens. This workflow runs [RepeatMasker](https://www.repeatmasker.org/) to annotate repeat motifs, followed by [Liftoff](https://github.com/agshumate/Liftoff) to annotate functional regions. These results are combined into a single feature file (.gff3). The Liftoff annotation workflow requires a reference genome (.fasta), reference feature .gff, single sample .fasta files, and metadata in Excel .xlsx format. Be sure to specify the correct database in the params for this option. -[RepeatMasker and Liftoff Example] (Link) - -### 2. VADR - -* The VADR workflow annotates fasta-formatted viral genomes using RefSeq annotation from a set of homologous reference models. This workflow requires single sample fasta files, metadata in .xlsx format, and reference information for the pathogen genome. TOSTADAS comes packaged with support for [monkeypox (mpxv) annotation] (https://github.com/CDCgov/tostadas/tree/master/vadr_files/mpxv-models). You can find information on other supported pathogens at the [VADR GitHub Repository] (https://github.com/ncbi/vadr). -[VADR Example] (Link) - -### 3. Bakta - -* The Bakta workflow annotates fasta-formatted bacterial genomes & plasmids using the [Bakta](https://github.com/CDCgov/tostadas/tree/master#gene-annotation) software. This workflow requires single sample .fasta files, metadata in .xlsx format, and optional reference database for annotation (found [here](https://zenodo.org/records/7669534)). -[BAKTA Example] (Link) - -All annotation workflows produce a general feature format file (.gff3) and NCBI feature table (tbl) compatible with NCBI submission requirements. - -## Submission -The TOSTADAS Submission workflow generates the necessary files for Genbank submission, a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from [SeqSender](https://github.com/CDCgov/seqsender) public database submission pipeline. \ No newline at end of file diff --git a/docs/installation.md b/docs/installation.md deleted file mode 100644 index badd600a..00000000 --- a/docs/installation.md +++ /dev/null @@ -1,46 +0,0 @@ -# Installation - -## Environment Setup -❗ Note: If you are a CDC user, please follow the set-up instructions found on Page X - [CDC User Guide](https://github.com/CDCgov/tostadas/wiki/) - -### (1) Clone the repository to your local machine: -* `git clone https://github.com/CDCgov/tostadas.git` -❗ Note: If you have mamba or nextflow installed in your local environment, you may skip steps 2, 3 (mamba installation) and 6 (nextflow installation) accordingly. - -### (2) Install Mamba: -* `curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh` -* `bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge` -### (3) Add mamba to PATH: -* `export PATH="$HOME/mambaforge/bin:$PATH"` -### (4) Create the conda environment: -If you want to create the full-conda environment needed to run the pipeline outside of Nextflow (this enables you to run individual python scripts), then proceed with step 4a. - -If you want to run the pipeline using nextflow only (this will be most users), proceed with step 4b. Nextflow will handle environment creation and you would only need to install the nextflow package locally vs the entire environment. - - (4a) Create the conda environment and install the dependencies set in your environment.yml: - -* `cd tostadas` -* `mamba env create -n tostadas -f environment.yml` - (4b) Create an empty conda environment: - -* `conda create --name tostadas` -### (5) Activate the environment. -* `conda activate tostadas` -Verify which environment is active by running the following conda command: * `conda env list` . The active environment will be denoted with an asterisk * - -### (6) Install Nextflow using Use Mamba and the Bioconda Channel: -* `mamba install -c bioconda nextflow` -❗ Optionally, you may install nextflow without mamba by following the instructions found in the Nextflow Installation Documentation Page: [Nextflow Install](https://www.nextflow.io/docs/latest/getstarted.html) - -### (7) Ensure Nextflow was installed successfully by running nextflow -v -Expected Output: - -* `nextflow version ` -The exact version of Nextflow returned will differ from installation to installation. It is important that the command execute successfully, and a version number is returned. - -### (8) Run one of the following nextflow commands to execute the scripts with default parameters and the local run environment: -#### For Virus Reads -* `nextflow run main.nf -profile test, --virus` -#### For Bacterial Reads -* `nextflow run main.nf -profile test, --bacteria` -The outputs of the pipeline will appear in the test_output folder within the project directory. You can specify an output directory in the config file or by supplying a path to the `--output_dir` flag in your `nextflow run` command. \ No newline at end of file diff --git a/docs/links.md b/docs/links.md deleted file mode 100644 index 28f6a8f5..00000000 --- a/docs/links.md +++ /dev/null @@ -1,18 +0,0 @@ -# Helpful Links - -## Helpful Links for Resources and Software Integrated with TOSTADAS: -🔗 Anaconda Install: (https://docs.anaconda.com/anaconda/install/) - -🔗 Nextflow Documentation: (https://www.nextflow.io/docs/latest/getstarted.html) - -🔗 SeqSender Documentation: (https://github.com/CDCgov/seqsender) - -🔗 Liftoff Documentation: (https://github.com/agshumate/Liftoff) - -🔗 VADR Documentation: (https://github.com/ncbi/vadr.git) - -🔗 Bakta Documentation: (https://github.com/oschwengers/bakta) - -🔗 table2asn Documentation: (https://github.com/svn2github/NCBI_toolkit/blob/master/src/app/table2asn/table2asn.cpp) - -🔗 RepeatMasker Documentation: (https://www.repeatmasker.org/) \ No newline at end of file diff --git a/docs/outputs.md b/docs/outputs.md deleted file mode 100644 index 67d724d5..00000000 --- a/docs/outputs.md +++ /dev/null @@ -1,63 +0,0 @@ -# Outputs - -The following section walks through the outputs from the pipeline. - -## 6.1 Pipeline Overview: -The workflow will generate outputs in the following order: - -* Validation - * Responsible for QC of metadata - * Aligns sample metadata .xlsx to sample .fasta - * Formats metadata into .tsv format -* Annotation - * Extracts features from .gff - * Aligns features - * Annotates sample genomes outputting .gff -* Submission - * Formats for database submission - This section runs twice, with the second run occuring after a wait time to allow for all samples to be uploaded to NCBI. Entrypoint `only_update_submission` can be run as many times as necessary until all files are fully uploaded. - -## 6.2 Output Directory Formatting: -The outputs are recorded in the directory specified within the nextflow.config file and will contain the following: - -* validation_outputs (name configurable with `val_output_dir`) - * name of metadata sample file - * errors - * fasta - * tsv_per_sample -* liftoff_outputs (name configurable with `final_liftoff_output_dir`) - * name of metadata sample file - * errors - * fasta - * liftoff - * tbl -* vadr_outputs (name configurable with `vadr_output_dir`) - * name of metadata sample file - * errors - * fasta - * gffs - * tbl -* bakta_outputs (name configurable with `bakta_output_dir`) - * name of metadata sample file - * fasta - * gff - * tbl -* submission_outputs (name and path configurable with `submission_output_dir`) - * name of annotation results (Liftoff or VADR, etc.) - * individual_sample_batch_info - * biosample_sra - * genbank - * accessions.csv - * terminal_outputs - * commands_used - -## 6.3 Understanding Pipeline Outputs: -The pipeline outputs include: - -* metadata.tsv files for each sample -* separate fasta files for each sample -* separate gff files for each sample -* separate tbl files containing feature information for each sample -* submission log file - * This output is found in the `submission_outputs` file in your specified `output_directory`. - * If the file can not be found you can run the `only_update_submission` entrypoint for the pipeline. diff --git a/docs/parameters.md b/docs/parameters.md deleted file mode 100644 index f65575ed..00000000 --- a/docs/parameters.md +++ /dev/null @@ -1,136 +0,0 @@ -# Parameters - -Default parameters are given in the nextflow.config file. This table lists the parameters that can be changed to a value, path or true/false. When changing these parameters pay attention to the required inputs and make sure that paths line-up and values are within range. To change a parameter you may change with a flag after the nextflow command or change them within your standard_params.config or standard.yaml file. - - * Please note the correct formatting and the default calculation of submission_wait_time at the bottom of the params table. -## 7.1 Input Files - - -|Param |Description |Input Required| -|------------|--------------------------------------------------------|-------------| -|--fasta_path| Path to directory containing single sample fasta files| Yes (path as string)| -|--ref_fasta_path| Reference Sequence file path| Yes (path as string)| -|--meta_path| Meta-data file path for samples| Yes (path as string) -|--ref_gff_path| Reference gff file path for annotation| Yes (path as string)| -|--bakta_db_path| Path to Bakta reference database| Yes (path as string)| -|--env_yml| Path to environment.yml file| Yes (path as string)| - -## 7.2 Run Environment -|Param |Description |Input Required| -|-------|---------------|--------------| -|--scicomp |Flag for whether running on Scicomp or not| Yes (true/false as bool)| -|--docker_container| Name of the Docker container| Yes, if running with docker profile (name as string)| -|--docker_container_vadr| Name of the Docker container to run VADR annotation| Yes, if running with docker profile (name as string)| -|--docker_container_bakta| Name of the Docker container to run Bakta annotation| Yes, if running with docker profile (name as string)| - -## 7.3 General Subworkflow -|Param |Description |Input Required| -|-------|---------------|--------------| -|--run_submission |Toggle for running submission |Yes (true/false as bool)| -|--run_liftoff |Toggle for running liftoff annotation |Yes (true/false as bool)| -|--run_vadr |Toggle for running vadr annotation |Yes (true/false as bool)| -|--run_bakta |Toggle for running Bakta annotation |Yes (true/false as bool)| -|--cleanup |Toggle for running cleanup subworkflows |Yes (true/false as bool)| - -## 7.4 Cleanup Subworkflow - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--clear_nextflow_log |Clears nextflow work log |Yes (true/false as bool)| -|--clear_nextflow_dir |Clears nextflow working directory |Yes (true/false as bool)| -|--clear_work_dir |Param to clear work directory created during workflow |Yes (true/false as bool)| -|--clear_conda_env |Clears conda environment |Yes (true/false as bool)| -|--clear_nf_results |Remove results from nextflow outputs |Yes (true/false as bool)| - -## 7.5 General Output - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--output_dir |File path to submit outputs from pipeline |Yes (path as string)| -|--overwrite_output |Toggle to overwriting output files in directory |Yes (true/false as bool)| - -## 7.6 Metadata Validation - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--val_output_dir |File path for outputs specific to validate sub-workflow |Yes (folder name as string)| -|--val_date_format_flag |Flag to change date output |Yes (-s, -o, or -v as string)| -|--val_keep_pi |Flag to keep personal identifying info, if provided otherwise it will return an error |Yes (true/false as bool)| -|--validate_custom_fields |Toggle checks/transformations for custom metadata fields on/off |No (true/false as bool)| -|--custom_fields_file |Path to the JSON file containing custom metadata fields and their information |No (path as string)| - -## 7.7 Liftoff - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--final_liftoff_output_dir |File path to liftoff specific sub-workflow outputs |Yes (folder name as string)| -|--lift_print_version_exit |Print version and exit the program |Yes (true/false as bool)| -|--lift_print_help_exit |Print help and exit the program |Yes (true/false as bool)| -|--lift_parallel_processes |Number of parallel processes to use for liftoff |Yes (integer)| -|--lift_delete_temp_files |Deletes the temporary files after finishing transfer |Yes (true/false as string)| -|--lift_child_feature_align_threshold |Only if its child features usually exons/CDS align with sequence identity || -|--lift_unmapped_feature_file_name |Name of unmapped features file name |Yes (path as string)| -|--lift_copy_threshold |Minimum sequence identity in exons/CDS for which a gene is considered a copy; must be greater than -s; default is 1.0 |Yes (float)| -|--lift_distance_scaling_factor |Distance scaling factor; by default D =2.0 |Yes (float)| -|--lift_flank |Amount of flanking sequence to align as a fraction of gene length |Yes (float between [0.0-1.0])| -|--lift_overlap |Maximum fraction of overlap allowed by 2 features |Yes (float between [0.0-1.0])| -|--lift_mismatch |Mismatch penalty in exons when finding best mapping; by default M=2 |Yes (integer)| -|--lift_gap_open |Gap open penalty in exons when finding best mapping; by default GO=2 |Yes (integer)| -|--lift_gap_extend |Gap extend penalty in exons when finding best mapping; by default GE=1 |Yes (integer)| -|--lift_infer_transcripts |Use if annotation file only includes exon/CDS features and does not include transcripts/mRNA |Yes (True/False as string)| -|--lift_copies |Look for extra gene copies in the target genome |Yes (True/False as string)| -|--lift_minimap_path |Path to minimap if you did not use conda or pip |Yes (N/A or path as string)| -|--lift_feature_database_name |Name of the feature database, if none, then will use ref gff path to construct one |Yes (N/A or name as string)| -|--repeat_lib |Path to repeatmasker library |No, default is mpox| - -## 7.8 VADR - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--vadr_output_dir |File path to vadr specific sub-workflow outputs |Yes (folder name as string)| -|--vadr_models_dir |File path to models for MPXV used by VADR annotation |Yes (folder name as string)| - -## 7.9 Bakta - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--bakta_db_path |Path to Bakta database if user is supplying database |No (path to database)| -|--download_bakta_db |Option to download Bakta database |Yes (true/false)| -|--bakta_db_type |Bakta database type (light or full) |Yes (string)| -|--bakta_output_dir |File path to bakta specific sub-workflow outputs |Yes (folder name as string)| -|--bakta_min_contig_length |Minimum contig size |Yes (integer)| -|--bakta_threads |Number of threads to use while running annotation |Yes (integer)| -|--bakta_genus |Organism genus name |Yes (N/A or name as string)| -|--bakta_species |Organism species name |Yes (N/A or name as string)| -|--bakta_strain |Organism strain name |Yes (N/A or name as string)| -|--bakta_plasmid |Name of plasmid |Yes (unnamed or name as string)| -|--bakta_locus |Locus prefix |Yes (contig or name as string)| -|--bakta_locus_tag |Locus tag prefix |Yes (autogenerated or name as string)| -|--bakta_translation_table |Translation table |Yes (integer)| -|--bakta_db_path |Path to Bakta reference database |Yes (path as string)| -|--bakta_db_type |Type of database to use for annotation |Yes (full/light as bool)| -|--download_bakta_db |Toggle to download Bakta database |Yes (true/false as bool)| -|--run_bakta |Toggle for running Bakta annotation |Yes (true/false as bool)| - -## 7.10 Sample Submission - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--submission_output_dir |Either name or relative/absolute path for the outputs from submission |Yes (name or path as string)| -|--submission_prod_or_test |Whether to submit samples for test or actual production |Yes (prod or test as string)| -|--submission_config |Configuration file for submission to public repos |Yes (path as string)| -|--submission_wait_time |Calculated based on sample number (3 * 60 secs * sample_num) |integer (seconds)| -|--batch_name |Name of the batch to prefix samples with during submission |Yes (name as string)| -|--send_submission_email |Toggle email notification on/off |Yes (true/false as bool)| -|--req_col_config |Path to the required_columns.yaml file |Yes (path as string)| -|--processed_samples |Path to the directory containing processed samples for update only submission entrypoint (containing . dirs) |Yes (path as string)| - -❗ Important note about send_submission_email: An email is only triggered if Genbank is being submitted to AND table2asn is the genbank_submission_type. As for the recipient, this must be specified within your submission config file under 'general' as 'notif_email_recipient'* - -## 7.11 Entrypoint and User Provided Annotation - -|Param |Description |Input Required| -|-------|---------------|--------------| -|--final_split_metas_path |Full path directly to the directory containing validated metadata file(s) (1 per sample) |Yes (path as string)| -|--final_annotated_files_path |Full path directly to the directory containing annotation file(s) (1 per sample) |Yes (path as string)| -|-- final_split_fastas_path |Full path directly to the directory containing fasta(s) (1 per sample) |Yes (path as string)| \ No newline at end of file diff --git a/docs/profile.md b/docs/profile.md deleted file mode 100644 index 9d7f18c0..00000000 --- a/docs/profile.md +++ /dev/null @@ -1,94 +0,0 @@ -# Profile Options & Input Files - -This section walks through the available parameters to customize your workflow. - -## Input Files Required: -### (A) This table lists the required files to run metadata validation and annotation: - -|Input files |File type |Description| -|---------------|-----------|-----------| -|Running Annotation and Submission||| -|fasta |.fasta |Single sample fasta sequence file(s)| -|metadata |.xlsx |Multi-sample metadata matching metadata spreadsheets provided in input_files| -|Running Submission only without Annotation (Raw Files)||| -|fasta |.fasta |Single sample fasta sequence file(s)| -|metadata |.xlsx |Multi-sample metadata matching metadata spreadsheets provided in input_files| -| fasta | .fasta | Single sample fasta sequence file(s) | | metadata | .xlsx | Multi-sample metadata matching metadata spreadsheets provided in input_files | | ref_fasta | .fasta | Reference genome to use for the liftoff_submission branch of the pipeline | | ref_gff | .gff | Reference GFF3 file to use for the liftoff_submission branch of the pipeline | - -** Please note that the pipeline expects ONLY pre-split FASTA files, where each FASTA file contains only the sequence(s) associated with its corresponding sample. The name of each FASTA file corresponding to a particular sample must be placed within your metadata sheet under fasta_file_name. - -Here is an example of how this would look like. - -### (B) This table lists the required files to run with submission: - -|Input files |File type |Description| -|fasta |.fasta |Single sample fasta sequence file(s) sequences| -|metadata |.xlsx |Multi-sample metadata matching metadata spreadsheets provided in input_files| -|ref_fasta |.fasta |Reference genome to use for the liftoff_submission branch of the pipeline| -|ref_gff |.gff |Reference GFF3 file to use for the liftoff_submission branch of the pipeline| -|submission_config |.yaml |configuration file for submitting to NCBI, sample versions can be found in repo| - -## Customizing Parameters: -Parameters can be customized from the nextflow.config file or from the command line, through the use of flags. - -### Customizing the nextflow.config -The nextflow.config file is where parameters can be adjusted based on preference for running the pipeline. - -Adjust your file inputs within standard_params.config ensuring accurate file paths for the inputs listed above. The params can be changed within the standard_params.config or you can change the standard.yml/standard.json file inside the params directory and pass it in with: `-params-file ` - -❗ DO NOT EDIT the `main.nf` file unless familiar with editing nextflow workflows - -### Customizing Parameters from the Command Line -Parameters can be overridden during runtime by providing various flags to the nextflow command. - -Example: Modifying the path of the output directory - -`nextflow run main.nf -profile test,singularity --virus --output_dir /path/to/output/dir` -Certain parameters such as `-profile` and pathogen type (`--virus`) are required, while others like `--output_dir` can be specified optionally. The complete list of parameters and the types of input that they require can be found in the Parameters page. - -## Understanding Profiles and Environments: -Within the nextflow pipeline the `-profile` parameter is required as an input. The profile option can be specified as `test`. If test is not specified, parameters are read from the `nextflow.config` file. The test params should remain the same for testing purposes, but the standard profile can be changed to fit user preferences. The run environment is supplied as the second argument to the `profile` parameter. The options of `docker`, `singularity` or `conda` can passed in. The conda environment is less stable than the docker or singularity. we recommend you choose docker or singularity when running the pipeline. - -## Running with Annotation and Submission: -By default, the pipeline will run the annotation and submission and sub-workflows. You may specify which databases to submit to using the database flags `--genbank` or `--sra`. - -* `nextflow run main.nf -profile --virus --genbank --sra --submission_wait_time 5` -params listed here: https://github.com/CDCgov/tostadas/blob/dev/nextflow.config - -## Running Submission Only: -By default, the pipeline will run the annotation and submission and sub-workflows. You may override this but using the `--submission` and `--annotation` flags. To run only submission, use the flag `--annotation false` in your nextflow command. - -❗ Note: you can only submit raw files to SRA, not to Genbank. - -* `nextflow run main.nf -profile , -- --annotation false --sra --submission_wait_time 5` -Now that your file paths are set within your standard.yml or standard.json or nextflow.config file, you will want to define whether to run the full pipeline with submission or without submission. This is defined within the standard_params.config file underneath the subworkflow section as submission submission = true/false - -### Q. Can we use standard.yml standard.json? - -Submission Pre-requisites: -The submission component of the pipeline is adapted from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields with your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below. - -#### (A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS: - -NCBI: If uploading to NCBI archives such as BioSample/SRA/Genbank, you must complete the following steps: - -* Create a center account: Contact the following e-mail for account creation: sra@ncbi.nlm.nih.gov and provide the following information: - * Suggested center abbreviation (16 char max) - * Center name (full), center URL & mailing address (including country and postcode) - * Phone number (main phone for center or lab) - * Contact person (someone likely to remain at the location for an extended time) - * Contact email (ideally a service account monitored by several people) - * Whether you intend to submit via FTP or command line Aspera (ascp) - * Gain access to an upload directory: Following center account creation, a test area and a production area will be created. Deposit the XML file and related data files into a directory and follow the instructions SRA provides via email to indicate when files are ready to trigger the pipeline. - * GISAID: A GISAID account is required for submission to GISAID, you can register for an account at (https://www.gisaid.org/). Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at (hcov-19@gisaid.org) to receive a personal CID. GISAID support is not yet implemented but it may be added in the future. - -#### (B) Config File Set-up: - -The template for the submission config file can be found in `bin/default_config_files` within the repo. This is where you can edit the various parameters you want to include in your submission. Read more at the [SeqSender](https://cdcgov.github.io/seqsender/#id_3-config-file-creation) docs. -You can find more information on how to setup your own submission config and additional information on fields in the following guide: [Submission Config Guide](https://github.com/CDCgov/tostadas/blob/b904111d78262efb82589bdd72b0482f27770f87/docs/submission_config_guide.md). -❗ Pre-requisite to submit to GenBank: Copy the program [table2asn](https://www.ncbi.nlm.nih.gov/genbank/table2asn/) to you tostadas/bin directory by running the following lines of code: - -* `cd ./tostadas/bin/` -* `wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/linux64.table2asn.gz` -* `gunzip linux64.table2asn.gz` -* `mv linux64.table2asn table2asn` \ No newline at end of file diff --git a/docs/quickstart.md b/docs/quickstart.md deleted file mode 100644 index 8615f9f5..00000000 --- a/docs/quickstart.md +++ /dev/null @@ -1,61 +0,0 @@ -# Quick Start - -## Steps -(1) Check that you are in the directory where the TOSTADAS repository was installed by running `pwd` -Expected Output: - -* `/path/to/working/directory/tostadas` -This is the default directory set in nextflow.config for the provided test input files. - -(2) Change the submission_config parameter within `test_params.config` or `nextflow.config` (if running with your own data) to the location of your personal submission config file. Note that we provide a virus and bacterial test config depending on the use case. -❗ You must have your personal submission configuration file set up before running the default parameters for the pipeline and/or if you plan on using sample submission at all. More information on setting this up can be found here: More Information on Submission - -(3) Pipeline Execution Examples -We describe a few use-cases of the pipeline below. For more information on input parameters, refer to the documentation found in the following pages: - -## Profile Options and Input Files -### Parameters -❗ Note: For all use cases, the paths to the required files should be specified in the nextflow.config file or the params.yaml file. - -#### Use Case 1: Running Annotation and Submission -1. Annotate viral assemblies and submit to GenBank and SRA - -Required files: fasta files, metadata file - -* `nextflow run main.nf -profile --virus --genbank --sra --submission_wait_time 5` -##### Breakdown: - -`-profile:` -This parameter is required. Specify the profile and run-time environment (`singularity`, `docker` or `conda`). Conda implementation is less stable, `singularity` or `docker` is recommended. -`--virus:` -The pathogen type is specified as `virus` -`--sra:` -`sra` is specified as the database to submit to -`--submission_wait_time`: -This parameter is optional. Running the pipeline with default parameters will trigger a wait time equal to # of samples * 180 seconds. This default parameter can be overridden by supplying an integer value to the `submission_wait_time` parameter. -`--genbank`: -`genbank` is specified as the database to submit to -❗ Pre-requisite to submit to GenBank: In order to submit to GenBank, the program table2asn must be executable in your local environment. Copy the program table2asn to you tostadas/bin directory by running the following lines of code: - -* `cd ./tostadas/bin/` -* `wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/linux64.table2asn.gz` -* `gunzip linux64.table2asn.gz` -* `mv linux64.table2asn table2asn` - -2. Annotate bacterial assemblies and submit to GenBank and SRA ** - -Required files: fasta files, metadata file - -(A) Download BAKTA Database - -* `nextflow run main.nf -profile --bacteria --genbank --sra --submission_wait_time 5 --download_bakta_db --bakta_db_type ` -(B) Provide path to existing BAKTA Database - -* `nextflow run main.nf -profile --bacteria --genbank --sra --submission_wait_time 5 --bakta_db_path` - -Breakdown: - -`--bacteria`: -The pathogen type is specified as `bacteria` -`--genbank`: -`genbank` is specified as the database to submit to \ No newline at end of file diff --git a/docs/submission.md b/docs/submission.md deleted file mode 100644 index 324a526a..00000000 --- a/docs/submission.md +++ /dev/null @@ -1,38 +0,0 @@ -# Submission Guide - -## Toggling Submission: -You will want to define whether to run the full pipeline with submission or without submission using the `--submission` flag. By default the pipeline will submit to GenBank and SRA. If you want to submit to only SRA, specify `--genbank false --sra`. - -## Submission Pre-requisites: -Link to this guide and remove from previous section - -The submission component of the pipeline uses the processes that are directly integrated from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields with your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below. - -### (A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS: - -NCBI: If uploading to NCBI archives such as BioSample/SRA/Genbank, you must complete the following steps: - -* Create a center account: Contact the following e-mail for account creation : sra@ncbi.nlm.nih.gov and provide the following information: - * Suggested center abbreviation (16 char max) - * Center name (full), center URL & mailing address (including country and postcode) - * Phone number (main phone for center or lab) - * Contact person (someone likely to remain at the location for an extended time) - * Contact email (ideally a service account monitored by several people) - * Whether you intend to submit via FTP or command line Aspera (ascp) - * Gain access to an upload directory: Following center account creation, a test area and a production area will be created. Deposit the XML file and related data files into a directory and follow the instructions SRA provides via email to indicate when files are ready to trigger the pipeline. - * GISAID: A GISAID account is required for submission to GISAID, you can register for an account at (https://www.gisaid.org/). Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at hcov-19@gisaid.org to receive a personal CID. GISAID support is not yet implemented but it may be added in the future. - -### (B) Config File Set-up: - -The template for the submission config file can be found in `bin/default_config_files` within the repo. This is where you can edit the various parameters you want to include in your submission. Read more at the SeqSender docs. -You can find more information on how to setup your own submission config and additional information on fields in the following guide: Submission Config Guide. - -❗ Pre-requisite to submit to GenBank: Copy the program `table2asn` to your `tostadas/bin` directory by running the following lines of code: - -* `cd ./tostadas/bin/` -* `wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/linux64.table2asn.gz` -* `gunzip linux64.table2asn.gz` -* `mv linux64.table2asn table2asn` - -## Required Files for Submission -* genbank