diff --git a/index.html b/index.html
index ecc8adc..f68a85f 100644
--- a/index.html
+++ b/index.html
@@ -136,7 +136,7 @@
About the pipeline
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible.
Like other workflow languages it provides useful features like -resume
to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run).
The nf-core project provided overall project template, pre-written software modules when available, and general best-practice recommendations.
-ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline.
+ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow organization provides suitable entry and exit points for users who want to run only a portion of the full pipeline.
Genome subsetting
The user can optionally subdivide their set of genomes into related lineages identified by PopPUNK (See documentation). PopPUNK quickly assignes genomes to 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome from each lineage. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced and their own specific knowledge of the genetic population structure of the taxon being analyzed.
Short-read processing and assembly
@@ -386,5 +386,5 @@ Citing ARETE
diff --git a/search/search_index.json b/search/search_index.json
index 65d1d1c..82e7d45 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Check out the full ARETE documentation for more information What is ARETE? ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, cosegregation, coevolution, and phylogenetic trees comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be easily fed into packages such as Coeus and MicroReact for further analyses. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some taxonomic groups than others. In general, the tools in ARETE work best at the species and genus level of relatedness. A key design feature of ARETE is the versatility to find the right blend of software packages and parameter settings that best handle datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets from fewer than ten to over 10,000 genomes from a diversity of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key feature is enabling the user choice to run specific subsets of the pipeline; a user may already have assembled genomes, or they may not care about, say, recombination detection. There are also cases where it might be necessary to manually review the outputs from a particular step before moving on to the next one; ARETE makes this manual QC easy to do. Table of Contents About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE About the pipeline The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best-practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting The user can optionally subdivide their set of genomes into related lineages identified by PopPUNK ( See documentation ). PopPUNK quickly assignes genomes to 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome from each lineage. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced and their own specific knowledge of the genetic population structure of the taxon being analyzed. Short-read processing and assembly Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection ( optionally ) Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution ( optionally ) Identification of coordinated gain and loss of features using EvolCCM ( EvolCCM ) Lateral gene transfer ( optionally ) Phylogenetic inference of LGT using rSPR ( rSPR ) Gene order ( optionally ) Comparison of genomic neighbourhoods using the Gene Order Workflow ( Gene Order Workflow ) See our roadmap for a full list of future development targets. Quick Start Install nextflow Install Docker or Singularity . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. In case of input datasets larger than 100 samples, check our resource profiles documentation , for optimal usage. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 , with the first column being sample names and the other two corresponding to compressed FASTQ files. Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all parameters currently implemented in the pipeline and which ones are required. See the FAQ for a list of frequently asked questions and common issues. Testing To test the worklow on a minimal dataset you can use the test configuration (with either docker or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker To accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light We also provide a larger test dataset, under -profile test_full , for use in ARETE's annotation entry. This dataset is comprised of 8 bacterial genomes. As a note, this can take upwards of 20 minutes to complete on an average personal computer . Replace docker below as appropriate. nextflow run beiko-lab/ARETE -entry annotation -profile test_full,docker Examples The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples of how runs can be adjusted to accommodate genome sets of different sizes: Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 300-ish genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ -entry annotation \\ -profile medium,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile medium,docker - Run tools in docker containers. For -profile medium , check our resource requirements documentation . Annotation to evolutionary dynamics on 10,000 genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ -entry annotation \\ -profile large,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile large,docker - Run tools in docker containers. For -profile large , check our resource requirements documentation . Annotation on a tiny dataset (4-12 genomes) in a personal computer While ARETE is primarily designed to run in HPC clusters, we have implemented a simple, bare-bones version that is able to run on most modern computers and laptops, with at most 6 CPU cores and a minimum of 8GB of memory. Keep in mind this will make it impossible to run most tools included in ARETE, but it should still provide a useful testing ground. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm \\ -entry annotation \\ -profile light,docker Note the addition of the light profile, this is the configuration for running on personal computers. Check out how to assign resource requests for even more customization. Run all ARETE subworkflows in a small dataset The command below will run all tools included in the annotation subworkflow and will enable the recombination, gene order, rSPR and evolCCM subworkflows. Be aware that the performance of the evolCCM and Gene Order subworkflows with large or very diverse datasets can be subpar. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,integronfinder,report' \\ --run_recombination \\ --run_evolccm \\ --run_rspr \\ --run_gene_order \\ --poppunk_model dbscan \\ -profile docker Credits The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada. Contributing to ARETE If you would like to contribute to ARETE, please see the contributing guidelines . Citing ARETE Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Home"},{"location":"#what-is-arete","text":"ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, cosegregation, coevolution, and phylogenetic trees comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be easily fed into packages such as Coeus and MicroReact for further analyses. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some taxonomic groups than others. In general, the tools in ARETE work best at the species and genus level of relatedness. A key design feature of ARETE is the versatility to find the right blend of software packages and parameter settings that best handle datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets from fewer than ten to over 10,000 genomes from a diversity of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key feature is enabling the user choice to run specific subsets of the pipeline; a user may already have assembled genomes, or they may not care about, say, recombination detection. There are also cases where it might be necessary to manually review the outputs from a particular step before moving on to the next one; ARETE makes this manual QC easy to do.","title":"What is ARETE?"},{"location":"#table-of-contents","text":"About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE","title":"Table of Contents"},{"location":"#about-the-pipeline","text":"The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best-practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline.","title":"About the pipeline"},{"location":"#genome-subsetting","text":"The user can optionally subdivide their set of genomes into related lineages identified by PopPUNK ( See documentation ). PopPUNK quickly assignes genomes to 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome from each lineage. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced and their own specific knowledge of the genetic population structure of the taxon being analyzed.","title":"Genome subsetting"},{"location":"#short-read-processing-and-assembly","text":"Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm )","title":"Short-read processing and assembly"},{"location":"#annotation","text":"Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond )","title":"Annotation"},{"location":"#phylogenomics","text":"( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites )","title":"Phylogenomics"},{"location":"#recombination-detection-optionally","text":"Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins )","title":"Recombination detection (optionally)"},{"location":"#coevolution","text":"( optionally ) Identification of coordinated gain and loss of features using EvolCCM ( EvolCCM )","title":"Coevolution"},{"location":"#lateral-gene-transfer","text":"( optionally ) Phylogenetic inference of LGT using rSPR ( rSPR )","title":"Lateral gene transfer"},{"location":"#gene-order","text":"( optionally ) Comparison of genomic neighbourhoods using the Gene Order Workflow ( Gene Order Workflow ) See our roadmap for a full list of future development targets.","title":"Gene order"},{"location":"#quick-start","text":"Install nextflow Install Docker or Singularity . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. In case of input datasets larger than 100 samples, check our resource profiles documentation , for optimal usage. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 , with the first column being sample names and the other two corresponding to compressed FASTQ files. Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all parameters currently implemented in the pipeline and which ones are required. See the FAQ for a list of frequently asked questions and common issues.","title":"Quick Start"},{"location":"#testing","text":"To test the worklow on a minimal dataset you can use the test configuration (with either docker or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker To accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light We also provide a larger test dataset, under -profile test_full , for use in ARETE's annotation entry. This dataset is comprised of 8 bacterial genomes. As a note, this can take upwards of 20 minutes to complete on an average personal computer . Replace docker below as appropriate. nextflow run beiko-lab/ARETE -entry annotation -profile test_full,docker","title":"Testing"},{"location":"#examples","text":"The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples of how runs can be adjusted to accommodate genome sets of different sizes:","title":"Examples"},{"location":"#assembly-annotation-and-pan-genome-inference-from-a-modestly-sized-dataset-50-or-so-genomes-from-paired-end-reads","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers.","title":"Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads"},{"location":"#annotation-to-evolutionary-dynamics-on-300-ish-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ -entry annotation \\ -profile medium,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile medium,docker - Run tools in docker containers. For -profile medium , check our resource requirements documentation .","title":"Annotation to evolutionary dynamics on 300-ish genomes"},{"location":"#annotation-to-evolutionary-dynamics-on-10000-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ -entry annotation \\ -profile large,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile large,docker - Run tools in docker containers. For -profile large , check our resource requirements documentation .","title":"Annotation to evolutionary dynamics on 10,000 genomes"},{"location":"#annotation-on-a-tiny-dataset-4-12-genomes-in-a-personal-computer","text":"While ARETE is primarily designed to run in HPC clusters, we have implemented a simple, bare-bones version that is able to run on most modern computers and laptops, with at most 6 CPU cores and a minimum of 8GB of memory. Keep in mind this will make it impossible to run most tools included in ARETE, but it should still provide a useful testing ground. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm \\ -entry annotation \\ -profile light,docker Note the addition of the light profile, this is the configuration for running on personal computers. Check out how to assign resource requests for even more customization.","title":"Annotation on a tiny dataset (4-12 genomes) in a personal computer"},{"location":"#run-all-arete-subworkflows-in-a-small-dataset","text":"The command below will run all tools included in the annotation subworkflow and will enable the recombination, gene order, rSPR and evolCCM subworkflows. Be aware that the performance of the evolCCM and Gene Order subworkflows with large or very diverse datasets can be subpar. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,integronfinder,report' \\ --run_recombination \\ --run_evolccm \\ --run_rspr \\ --run_gene_order \\ --poppunk_model dbscan \\ -profile docker","title":"Run all ARETE subworkflows in a small dataset"},{"location":"#credits","text":"The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada.","title":"Credits"},{"location":"#contributing-to-arete","text":"If you would like to contribute to ARETE, please see the contributing guidelines .","title":"Contributing to ARETE"},{"location":"#citing-arete","text":"Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Citing ARETE"},{"location":"CITATIONS/","text":"beiko-lab/ARETE: Citations An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. nf-core Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. Nextflow Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. Pipeline tools CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 EvolCCM Chaoyue Liu and others, The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes Based on Phylogenetic Profiles, Systematic Biology, Volume 72, Issue 3, May 2023, Pages 559\u2013574, https://doi.org/10.1093/sysbio/syac052 rSPR Christopher Whidden, Norbert Zeh, Robert G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft Distance, Systematic Biology, Volume 63, Issue 4, July 2014, Pages 566\u2013581, https://doi.org/10.1093/sysbio/syu023 Software packaging/containerisation tools BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Citations"},{"location":"CITATIONS/#beiko-labarete-citations","text":"An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038.","title":"beiko-lab/ARETE: Citations"},{"location":"CITATIONS/#nf-core","text":"Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.","title":"nf-core"},{"location":"CITATIONS/#nextflow","text":"Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.","title":"Nextflow"},{"location":"CITATIONS/#pipeline-tools","text":"CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 EvolCCM Chaoyue Liu and others, The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes Based on Phylogenetic Profiles, Systematic Biology, Volume 72, Issue 3, May 2023, Pages 559\u2013574, https://doi.org/10.1093/sysbio/syac052 rSPR Christopher Whidden, Norbert Zeh, Robert G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft Distance, Systematic Biology, Volume 63, Issue 4, July 2014, Pages 566\u2013581, https://doi.org/10.1093/sysbio/syu023","title":"Pipeline tools"},{"location":"CITATIONS/#software-packagingcontainerisation-tools","text":"BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Software packaging/containerisation tools"},{"location":"ROADMAP/","text":"A list in no particular order of outstanding development features, both in-progress and planned: Integration of additional tools and scripts: Partner applications for analysis and visualization of phylogenetic distributions of genes and MGEs and gene-order clustering (For example, Coeus ).","title":"Roadmap"},{"location":"contributing/","text":"beiko-lab/ARETE: Contributing Guidelines Hey! Thank you for taking an interest in contributing to ARETE. We use GitHub for managing issues, contribution requests and everything else. So feel free to communicate with us using new issues and discussions, whatever best fits your idea for your contribution Contribution Workflow The standard workflow for contributing to ARETE is as follows: Check first if there isn't already an issue for your feature request, bug, etc. If there isn't one, you should create a new issue or discussion for your planned contribution before starting working on it . Fork the beiko-lab/ARETE repository to your GitHub account Make the necessary changes or additions within your forked repository. You should probably create a new branch for your contribution, instead of committing directly to the master branch of your repository . In case any parameters were added or changed, use nf-core schema build to add them to the pipeline JSON schema and nf-core schema docs --output docs/params.md --force to update the respective documentation (requires nf-core tools >= 1.10). Optionally run the pipeline's unit tests locally using nf-test : nf-test test tests/subworkflows/local/* . Submit a Pull Request to our master branch and wait for your changes to be reviewed and merged by our maintainers. Our Github Actions workflows should perform a few pipeline tests automatically after receiving your pull request. Any errors on them should be looked at since they could point to underlying issues in your changes.","title":"Contributing"},{"location":"contributing/#beiko-labarete-contributing-guidelines","text":"Hey! Thank you for taking an interest in contributing to ARETE. We use GitHub for managing issues, contribution requests and everything else. So feel free to communicate with us using new issues and discussions, whatever best fits your idea for your contribution","title":"beiko-lab/ARETE: Contributing Guidelines"},{"location":"contributing/#contribution-workflow","text":"The standard workflow for contributing to ARETE is as follows: Check first if there isn't already an issue for your feature request, bug, etc. If there isn't one, you should create a new issue or discussion for your planned contribution before starting working on it . Fork the beiko-lab/ARETE repository to your GitHub account Make the necessary changes or additions within your forked repository. You should probably create a new branch for your contribution, instead of committing directly to the master branch of your repository . In case any parameters were added or changed, use nf-core schema build to add them to the pipeline JSON schema and nf-core schema docs --output docs/params.md --force to update the respective documentation (requires nf-core tools >= 1.10). Optionally run the pipeline's unit tests locally using nf-test : nf-test test tests/subworkflows/local/* . Submit a Pull Request to our master branch and wait for your changes to be reviewed and merged by our maintainers. Our Github Actions workflows should perform a few pipeline tests automatically after receiving your pull request. Any errors on them should be looked at since they could point to underlying issues in your changes.","title":"Contribution Workflow"},{"location":"faq/","text":"Frequently Asked Questions How do I run ARETE in a Slurm HPC environment? Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account=' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ). Do note that there could be memory-related issues when running Nextflow in SLURM environments. Can I use the ARETE outputs in MicroReact? Yes you can! In fact, ARETE provides many outputs that can be used in the MicroReact web app. Some of these files are: The PopPUNK lineages tree under poppunk_results/poppunk_visualizations/poppunk_visualizations.microreact . The reference tree built with FastTree under phylogenomics/reference_tree/core_gene_alignment.tre . The annotation feature profile annotation/feature_profile.tsv.gz . This file contains the annotation features in a presence/absence matrix format. Since MicroReact doesn't allow compressed files, just make sure to decompress it before-hand: gunzip feature_profile.tsv.gz Make sure to check our output documentation for a full list of outputs and the parameter documentation for a description of parameters to enable and disable these outputs. Why am I getting this 'docker: Permission denied' error? Although previous ARETE users have reported this issue, this is neither an issue with Nextflow nor with ARETE itself. This is most likely due to how Docker permissions are set up on your machine. If running on your own machine, take a look at this guide . If running on an HPC system, talk to your system administrator or consider running ARETE with Singularity . My server doesn't have that much memory! How do I change the resource requirements? Just write a file called nextflow.config in your working directory and add the following to it: process { withLabel:process_low { cpus = 6 memory = 8.GB time = 4.h } withLabel:process_medium { cpus = 12 memory = 36.GB time = 8.h } withLabel:process_high { cpus = 16 memory = 72.GB time = 20.h } withLabel:process_high_memory { memory = 200.GB } withName: MOB_RECON { cpus = 2 } } Feel free to change the values above as you wish and then add -c nextflow.config to your nextflow run beiko-lab/ARETE command. You can point to general process labels, like process_low , or you can point directly to process names, like MOB_RECON . Learn more at our usage documentation or the official nextflow documentation .","title":"FAQ"},{"location":"faq/#frequently-asked-questions","text":"","title":"Frequently Asked Questions"},{"location":"faq/#how-do-i-run-arete-in-a-slurm-hpc-environment","text":"Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account=' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ). Do note that there could be memory-related issues when running Nextflow in SLURM environments.","title":"How do I run ARETE in a Slurm HPC environment?"},{"location":"faq/#can-i-use-the-arete-outputs-in-microreact","text":"Yes you can! In fact, ARETE provides many outputs that can be used in the MicroReact web app. Some of these files are: The PopPUNK lineages tree under poppunk_results/poppunk_visualizations/poppunk_visualizations.microreact . The reference tree built with FastTree under phylogenomics/reference_tree/core_gene_alignment.tre . The annotation feature profile annotation/feature_profile.tsv.gz . This file contains the annotation features in a presence/absence matrix format. Since MicroReact doesn't allow compressed files, just make sure to decompress it before-hand: gunzip feature_profile.tsv.gz Make sure to check our output documentation for a full list of outputs and the parameter documentation for a description of parameters to enable and disable these outputs.","title":"Can I use the ARETE outputs in MicroReact?"},{"location":"faq/#why-am-i-getting-this-docker-permission-denied-error","text":"Although previous ARETE users have reported this issue, this is neither an issue with Nextflow nor with ARETE itself. This is most likely due to how Docker permissions are set up on your machine. If running on your own machine, take a look at this guide . If running on an HPC system, talk to your system administrator or consider running ARETE with Singularity .","title":"Why am I getting this 'docker: Permission denied' error?"},{"location":"faq/#my-server-doesnt-have-that-much-memory-how-do-i-change-the-resource-requirements","text":"Just write a file called nextflow.config in your working directory and add the following to it: process { withLabel:process_low { cpus = 6 memory = 8.GB time = 4.h } withLabel:process_medium { cpus = 12 memory = 36.GB time = 8.h } withLabel:process_high { cpus = 16 memory = 72.GB time = 20.h } withLabel:process_high_memory { memory = 200.GB } withName: MOB_RECON { cpus = 2 } } Feel free to change the values above as you wish and then add -c nextflow.config to your nextflow run beiko-lab/ARETE command. You can point to general process labels, like process_low , or you can point directly to process names, like MOB_RECON . Learn more at our usage documentation or the official nextflow documentation .","title":"My server doesn't have that much memory! How do I change the resource requirements?"},{"location":"issues/","text":"Known issues in ARETE PopPUNK We have experienced issues with PopPUNK in ARETE runs, primarily related to the distances and clusters generated and how these affect both the subsampling and recombination subworkflows. Sometimes the distances generated are too small or the number of clusters changes between executions of the same dataset. While we can't solve the latter since it is a result of how PopPUNK itself is defined, the former can be mitigated by adjusting the subsampling thresholds with --core_similarity (99.9 by default) and --accessory_similarity (99 by default), or disabling subsampling altogether ( --enable_subsetting false ). A useful course of action is to first run your dataset through the PopPUNK entry , and then choose the appropriate parameters for your final pipeline run. The PopPUNK entry takes at most 2 hours to run, even on large datasets . Disabling PopPUNK in your execution is also simple to do with --skip_poppunk . rSPR rSPR, which you can enable with --run_rspr or with the rSPR entry , is known to be very slow , especially with larger datasets. The default of 3 days for rSPR runtimes should be enough for some runs, but for most larger datasets it won't be sufficient. In this case, if you do want to run rSPR, we suggest two possible routes: Increasing the default time allocation for the RSPR_EXACT processes. Check out how . Ignoring timeout errors altogether and finish the pipeline execution with whatever finished running in these 3 days. This is the default course of action for ARETE . By choosing the second course of action, we ignore timeout errors generated with RSPR_EXACT and finish the execution of downstream processes, i.e. RSPR_HEATMAP , with whatever results we already have. This process generates a heatmap of Tree size and Exact rSPR distance. While RSPR_HEATMAP should execute with the results that were generated up to the timeout, we have heard from users that this process can still not run, even when results from RSPR_EXACT were generated. This issue has only been reported with older versions of nextflow, v23 onwards should work fine . While this issue is unfortunate it shouldn't be a big problem: The only output given by RSPR_HEATMAP is the aforementioned heatmap, which can also be generated externally by using our rspr_heatmap.py script or your own downstream analysis.","title":"Known issues"},{"location":"issues/#known-issues-in-arete","text":"","title":"Known issues in ARETE"},{"location":"issues/#poppunk","text":"We have experienced issues with PopPUNK in ARETE runs, primarily related to the distances and clusters generated and how these affect both the subsampling and recombination subworkflows. Sometimes the distances generated are too small or the number of clusters changes between executions of the same dataset. While we can't solve the latter since it is a result of how PopPUNK itself is defined, the former can be mitigated by adjusting the subsampling thresholds with --core_similarity (99.9 by default) and --accessory_similarity (99 by default), or disabling subsampling altogether ( --enable_subsetting false ). A useful course of action is to first run your dataset through the PopPUNK entry , and then choose the appropriate parameters for your final pipeline run. The PopPUNK entry takes at most 2 hours to run, even on large datasets . Disabling PopPUNK in your execution is also simple to do with --skip_poppunk .","title":"PopPUNK"},{"location":"issues/#rspr","text":"rSPR, which you can enable with --run_rspr or with the rSPR entry , is known to be very slow , especially with larger datasets. The default of 3 days for rSPR runtimes should be enough for some runs, but for most larger datasets it won't be sufficient. In this case, if you do want to run rSPR, we suggest two possible routes: Increasing the default time allocation for the RSPR_EXACT processes. Check out how . Ignoring timeout errors altogether and finish the pipeline execution with whatever finished running in these 3 days. This is the default course of action for ARETE . By choosing the second course of action, we ignore timeout errors generated with RSPR_EXACT and finish the execution of downstream processes, i.e. RSPR_HEATMAP , with whatever results we already have. This process generates a heatmap of Tree size and Exact rSPR distance. While RSPR_HEATMAP should execute with the results that were generated up to the timeout, we have heard from users that this process can still not run, even when results from RSPR_EXACT were generated. This issue has only been reported with older versions of nextflow, v23 onwards should work fine . While this issue is unfortunate it shouldn't be a big problem: The only output given by RSPR_HEATMAP is the aforementioned heatmap, which can also be generated externally by using our rspr_heatmap.py script or your own downstream analysis.","title":"rSPR"},{"location":"output/","text":"beiko-lab/ARETE: Output Introduction The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. Pipeline overview The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements annotation_report.tsv.gz - A tabular file aggregating annotation data from all genomes feature_profile.tsv.gz - A presence absence matrix of features in all genomes IslandPath, PhiSpy and IntegronFinder results are currently not added to the final annotation report. We seek to fix this issue in the future. PopPUNK Subworkflow PopPUNK - Genome clustering Dynamics EvolCCM - Community Coevolution rSPR - rooted subtree-prune-and-regraft distances Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Gene Order Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline Assembly FastQC read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. fastp read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads. Kraken2 read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination. Unicycler assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE. Quast assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast. Annotation Bakta annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs Prokka annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. RGI annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models. MobRecon annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases DIAMOND annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively. IslandPath annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes. IntegronFinder Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes. PhiSpy annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions. PopPUNK poppunk_results/ poppunk_db/ - Results from PopPUNK's create-db command poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes. Phylogenomics and Pangenomics Panaroo pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline. PPanGGoLiN pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes FastTree phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences IQTree phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. SNPsites phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment. Dynamics EvolCCM dynamics/EvolCCM/ EvolCCM_*tsv EvolCCM_*pvals EvolCCM_*X2 EvolCCM_*tre EvolCCM is the R implementation for CCM (Community Coevolution Model) rSPR The outputs are approximate and exact Subtree Prune and Regraft (rSPR) distances between pairs of rooted phylogenetic trees. Each CSV file contains these distances and the tree sizes. The PNG files are heatmaps of these distances and their respective tree sizes. dynamics/rSPR/ approx - Approximate rSPR distances exact - Exact rSPR distances rSPR is a software package for calculating rooted subtree-prune-and-regraft distances and rooted agreement forests. Recombination Verticall dynamics/recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination. SKA2 dynamics/recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers. Gubbins dynamics/recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Gene Order gene-order/ extraction/ - AMR genes of interest and their neighborhoods extracted from the assemblies. diamond/ - Pairwise alignments between all input genomes. clustering/ - Similarity and distance matrices for each AMR gene clustered via UPGMA, MCL and DBSCAN to identify similarities between their neighborhoods across all genomes. Gene Order is a subworkflow for bacterial gene order analysis, with outputs easily explorable through its partner visualization application Coeus . Pipeline information pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. MultiQC multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"Output"},{"location":"output/#beiko-labarete-output","text":"","title":"beiko-lab/ARETE: Output"},{"location":"output/#introduction","text":"The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.","title":"Introduction"},{"location":"output/#pipeline-overview","text":"The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements annotation_report.tsv.gz - A tabular file aggregating annotation data from all genomes feature_profile.tsv.gz - A presence absence matrix of features in all genomes IslandPath, PhiSpy and IntegronFinder results are currently not added to the final annotation report. We seek to fix this issue in the future. PopPUNK Subworkflow PopPUNK - Genome clustering Dynamics EvolCCM - Community Coevolution rSPR - rooted subtree-prune-and-regraft distances Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Gene Order Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline","title":"Pipeline overview"},{"location":"output/#assembly","text":"","title":"Assembly"},{"location":"output/#fastqc","text":"read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.","title":"FastQC"},{"location":"output/#fastp","text":"read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads.","title":"fastp"},{"location":"output/#kraken2","text":"read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination.","title":"Kraken2"},{"location":"output/#unicycler","text":"assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE.","title":"Unicycler"},{"location":"output/#quast","text":"assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast.","title":"Quast"},{"location":"output/#annotation","text":"","title":"Annotation"},{"location":"output/#bakta","text":"annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs","title":"Bakta"},{"location":"output/#prokka","text":"annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.","title":"Prokka"},{"location":"output/#rgi","text":"annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models.","title":"RGI"},{"location":"output/#mobrecon","text":"annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases","title":"MobRecon"},{"location":"output/#diamond","text":"annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively.","title":"DIAMOND"},{"location":"output/#islandpath","text":"annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes.","title":"IslandPath"},{"location":"output/#integronfinder","text":"Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes.","title":"IntegronFinder"},{"location":"output/#phispy","text":"annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.","title":"PhiSpy"},{"location":"output/#poppunk","text":"poppunk_results/ poppunk_db/ - Results from PopPUNK's create-db command poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes.","title":"PopPUNK"},{"location":"output/#phylogenomics-and-pangenomics","text":"","title":"Phylogenomics and Pangenomics"},{"location":"output/#panaroo","text":"pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline.","title":"Panaroo"},{"location":"output/#ppanggolin","text":"pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes","title":"PPanGGoLiN"},{"location":"output/#fasttree","text":"phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences","title":"FastTree"},{"location":"output/#iqtree","text":"phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood.","title":"IQTree"},{"location":"output/#snpsites","text":"phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment.","title":"SNPsites"},{"location":"output/#dynamics","text":"","title":"Dynamics"},{"location":"output/#evolccm","text":"dynamics/EvolCCM/ EvolCCM_*tsv EvolCCM_*pvals EvolCCM_*X2 EvolCCM_*tre EvolCCM is the R implementation for CCM (Community Coevolution Model)","title":"EvolCCM"},{"location":"output/#rspr","text":"The outputs are approximate and exact Subtree Prune and Regraft (rSPR) distances between pairs of rooted phylogenetic trees. Each CSV file contains these distances and the tree sizes. The PNG files are heatmaps of these distances and their respective tree sizes. dynamics/rSPR/ approx - Approximate rSPR distances exact - Exact rSPR distances rSPR is a software package for calculating rooted subtree-prune-and-regraft distances and rooted agreement forests.","title":"rSPR"},{"location":"output/#recombination","text":"","title":"Recombination"},{"location":"output/#verticall","text":"dynamics/recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination.","title":"Verticall"},{"location":"output/#ska2","text":"dynamics/recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers.","title":"SKA2"},{"location":"output/#gubbins","text":"dynamics/recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions.","title":"Gubbins"},{"location":"output/#gene-order","text":"gene-order/ extraction/ - AMR genes of interest and their neighborhoods extracted from the assemblies. diamond/ - Pairwise alignments between all input genomes. clustering/ - Similarity and distance matrices for each AMR gene clustered via UPGMA, MCL and DBSCAN to identify similarities between their neighborhoods across all genomes. Gene Order is a subworkflow for bacterial gene order analysis, with outputs easily explorable through its partner visualization application Coeus .","title":"Gene Order"},{"location":"output/#pipeline-information","text":"pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.","title":"Pipeline information"},{"location":"output/#multiqc","text":"multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"MultiQC"},{"location":"params/","text":"beiko-lab/ARETE pipeline parameters AMR/VF LGT-focused bacterial genomics workflow Input/output options Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string Reference genome options Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string QC Parameter Description Type Default Required Hidden run_checkm Run CheckM QC software boolean apply_filtering Filter assemblies on QC results boolean skip_kraken Don't run Kraken2 taxonomic classification boolean min_n50 Minimum N50 for filtering integer 10000 min_contigs_1000_bp Minimum number of contigs with >1000bp integer 1 min_contig_length Minimum average contig length integer 1 Annotation Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet Phylogenomics Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True PopPUNK Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99.0 Gene Order Parameters for the Gene Order Subworkflow Parameter Description Type Default Required Hidden run_gene_order Whether to run the Gene Order subworkflow boolean gene_order_percent_cutoff Cutoff percentage of genomes a gene should be present within to be included in extraction and subsequent analysis. Should a float between 0 and 1 (e.g., 0.25 means only genes present in a minimum of 25% of genomes are kept). number 0.25 gene_order_label_cols If using annotation files predicting features, list of space separated column names to be added to the gene names string None num_neighbors Neighborhood size to extract. Should be an even number N, such that N/2 neighbors upstream and N/2 neighbors downstream will be analyzed. integer 10 inflation Inflation hyperparameter value for Markov Clustering Algorithm. integer 2 epsilon Epsilon hyperparameter value for DBSCAN clustering. number 0.5 minpts Minpts hyperparameter value for DBSCAN clustering. integer 5 plot_clustering Create Clustering HTML Plots boolean Recombination Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean Dynamics Parameter Description Type Default Required Hidden run_evolccm Run the community coevolution model boolean run_rspr Run rSPR boolean min_rspr_distance Minimum rSPR distance used to define processing groups integer 10 min_branch_length Minimum rSPR branch length integer 0 max_support_threshold Maximum rSPR support threshold number 0.7 max_approx_rspr Maximum approximate rSPR distance for filtering integer -1 min_heatmap_approx_rspr Minimum approximate rSPR distance used to generate heatmap integer 0 max_heatmap_approx_rspr Maximum approximate rSPR distance used to generate heatmap integer -1 min_heatmap_exact_rspr Minimum exact rSPR distance used to generate heatmap integer 0 max_heatmap_exact_rspr Maximum exact rSPR distance used to generate heatmap integer -1 core_gene_tree Core (or reference) genome tree. Used in the rSPR and evolCCM entries. string concatenated_annotation TSV table of annotations for all genomes. Such as the ones generated by Bakta or Prokka in ARETE. string feature_profile Feature profile TSV (A presence-absence matrix). Used in the evolCCM entry. string Institutional config options Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True Max job request options Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True Generic options Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string True","title":"Parameters"},{"location":"params/#beiko-labarete-pipeline-parameters","text":"AMR/VF LGT-focused bacterial genomics workflow","title":"beiko-lab/ARETE pipeline parameters"},{"location":"params/#inputoutput-options","text":"Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string","title":"Input/output options"},{"location":"params/#reference-genome-options","text":"Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string","title":"Reference genome options"},{"location":"params/#qc","text":"Parameter Description Type Default Required Hidden run_checkm Run CheckM QC software boolean apply_filtering Filter assemblies on QC results boolean skip_kraken Don't run Kraken2 taxonomic classification boolean min_n50 Minimum N50 for filtering integer 10000 min_contigs_1000_bp Minimum number of contigs with >1000bp integer 1 min_contig_length Minimum average contig length integer 1","title":"QC"},{"location":"params/#annotation","text":"Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet","title":"Annotation"},{"location":"params/#phylogenomics","text":"Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True","title":"Phylogenomics"},{"location":"params/#poppunk","text":"Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99.0","title":"PopPUNK"},{"location":"params/#gene-order","text":"Parameters for the Gene Order Subworkflow Parameter Description Type Default Required Hidden run_gene_order Whether to run the Gene Order subworkflow boolean gene_order_percent_cutoff Cutoff percentage of genomes a gene should be present within to be included in extraction and subsequent analysis. Should a float between 0 and 1 (e.g., 0.25 means only genes present in a minimum of 25% of genomes are kept). number 0.25 gene_order_label_cols If using annotation files predicting features, list of space separated column names to be added to the gene names string None num_neighbors Neighborhood size to extract. Should be an even number N, such that N/2 neighbors upstream and N/2 neighbors downstream will be analyzed. integer 10 inflation Inflation hyperparameter value for Markov Clustering Algorithm. integer 2 epsilon Epsilon hyperparameter value for DBSCAN clustering. number 0.5 minpts Minpts hyperparameter value for DBSCAN clustering. integer 5 plot_clustering Create Clustering HTML Plots boolean","title":"Gene Order"},{"location":"params/#recombination","text":"Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean","title":"Recombination"},{"location":"params/#dynamics","text":"Parameter Description Type Default Required Hidden run_evolccm Run the community coevolution model boolean run_rspr Run rSPR boolean min_rspr_distance Minimum rSPR distance used to define processing groups integer 10 min_branch_length Minimum rSPR branch length integer 0 max_support_threshold Maximum rSPR support threshold number 0.7 max_approx_rspr Maximum approximate rSPR distance for filtering integer -1 min_heatmap_approx_rspr Minimum approximate rSPR distance used to generate heatmap integer 0 max_heatmap_approx_rspr Maximum approximate rSPR distance used to generate heatmap integer -1 min_heatmap_exact_rspr Minimum exact rSPR distance used to generate heatmap integer 0 max_heatmap_exact_rspr Maximum exact rSPR distance used to generate heatmap integer -1 core_gene_tree Core (or reference) genome tree. Used in the rSPR and evolCCM entries. string concatenated_annotation TSV table of annotations for all genomes. Such as the ones generated by Bakta or Prokka in ARETE. string feature_profile Feature profile TSV (A presence-absence matrix). Used in the evolCCM entry. string","title":"Dynamics"},{"location":"params/#institutional-config-options","text":"Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True","title":"Institutional config options"},{"location":"params/#max-job-request-options","text":"Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True","title":"Max job request options"},{"location":"params/#generic-options","text":"Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string True","title":"Generic options"},{"location":"resource_profiles/","text":"ARETE and dataset size Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call small ), the medium profile and the large profile. These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets. If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the PopPUNK entry . One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into. The sizes are: For the default or small profile, we expect datasets with 100 samples/assemblies or fewer. It runs on the default pipeline parameters, with no changes. For the medium profile, we expect datasets with >100 and <1000 samples. It increases the default resource requirements for most processes and also uses PPanGGoLiN for pangenome construction, instead of Panaroo . For the large profile, we expect datasets with >1000 samples. It also increases default resource requirements for some processes and uses PPanGGoLin. Additionally, it enables PopPUNK subsampling , with default parameters . For the light profile, we expect datasets with at most 12 samples. This is a profile primarily designed to run on personal computers and it disables most ARETE processes.","title":"Dataset Size"},{"location":"resource_profiles/#arete-and-dataset-size","text":"Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call small ), the medium profile and the large profile. These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets. If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the PopPUNK entry . One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into. The sizes are: For the default or small profile, we expect datasets with 100 samples/assemblies or fewer. It runs on the default pipeline parameters, with no changes. For the medium profile, we expect datasets with >100 and <1000 samples. It increases the default resource requirements for most processes and also uses PPanGGoLiN for pangenome construction, instead of Panaroo . For the large profile, we expect datasets with >1000 samples. It also increases default resource requirements for some processes and uses PPanGGoLin. Additionally, it enables PopPUNK subsampling , with default parameters . For the light profile, we expect datasets with at most 12 samples. This is a profile primarily designed to run on personal computers and it disables most ARETE processes.","title":"ARETE and dataset size"},{"location":"subsampling/","text":"PopPUNK subsetting The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ . Example command The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Subsampling"},{"location":"subsampling/#poppunk-subsetting","text":"The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ .","title":"PopPUNK subsetting"},{"location":"subsampling/#example-command","text":"The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Example command"},{"location":"usage/","text":"beiko-lab/ARETE: Usage Introduction The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE. Samplesheet input No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]' Full workflow or assembly samplesheet The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline. Annotation only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline. Phylogenomics and Pangenomics only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow. Reference Genome For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta Running the pipeline The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation. Assembly Entry To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Assembly QC Entry To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Annotation Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker PopPUNK Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker Phylogenomics and Pangenomics Entry To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker rSPR Entry To execute the rSPR analysis on pre-existing trees: nextflow run beiko-lab/ARETE \\ -entry rspr \\ --input_sample_table samplesheet.csv \\ --core_gene_tree core_gene_alignment.tre \\ --concatenated_annotation BAKTA.txt \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --concatenated_annotation - The tabular annotation results (TSV) for all genomes, like the ones generated at the end of Prokka or Bakta in ARETE. Although useful, it's not necessary to execute the rSPR entry. --input_sample_table - A samplesheet containing all individual gene trees in the following format: gene_tree,path CDS_0000,/path/to/CDS_0000.tre CDS_0001,/path/to/CDS_0001.tre CDS_0002,/path/to/CDS_0002.tre CDS_0003,/path/to/CDS_0003.tre CDS_0004,/path/to/CDS_0004.tre evolCCM Entry To execute the evolCCM analysis on a pre-existing reference tree and feature profile: nextflow run beiko-lab/ARETE \\ -entry evolccm \\ --core_gene_tree core_gene_alignment.tre \\ --feature_profile feature_profile.tsv.gz \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --feature_profile - A presence/absence TSV matrix of features in genomes. Genome names should be the same in the core tree and should be contained to a 'genome_id' column, with all other columns represent features absent (0) or present (1) in each genome. I.e.: genome_id plasmid_AA155 plasmid_AA161 ED010 0 0 ED017 0 1 ED040 0 0 ED073 0 1 ED075 1 1 ED082 0 1 ED142 0 1 ED178 0 1 ED180 0 0 Recombination Entry To execute the recombination analysis on pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE \\ -entry recombination \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ -profile docker Gene Order Entry To execute the Gene Order analysis on pre-existing assemblies and RGI annotations: nextflow run beiko-lab/ARETE \\ -entry gene_order \\ --input_sample_table gene_order_samplesheet.csv \\ -profile docker --input_sample_table - A samplesheet containing a fasta file, a genbank file and an RGI output file for each assembly: sample,fna_file_path,gbk,rgi SAMD00052607,SAMD00052607.faa,SAMD00052607.gbk,SAMD00052607_rgi.txt SAMEA1466699,SAMEA1466699.faa,SAMEA1466699.gbk,SAMEA1466699_rgi.txt SAMEA1486355,SAMEA1486355.faa,SAMEA1486355.gbk,SAMEA1486355_rgi.txt Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE Reproducibility It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. Core Nextflow arguments NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen). -profile Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters -resume Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names. -c Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information. Custom resource requests Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information. Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g' Sometimes LevelDB, which is used by Nextflow to track execution metadata, can lead to memory-related issues, often showing as a SIGBUS error. This tends to happen when running Nextflow in SLURM environments . In this case, setting NXF_OPTS=\"-Dleveldb.mmap=false\" in your ~/.bashrc or immediately before executing nextflow run usually solves the issue. ARETE's storage requirements ARETE generates a lot of intermediary files, which is even further exacerbated if you are running on a dataset with more than 100 genomes. Before running ARETE you should make sure you have at least 500 GB of free storage. After running ARETE and checking your results, you can remove the work/ directory in your working directory, which is where Nextflow stores its cache. Be aware that deleting work/ will make it so your pipeline won't re-run with cache when using the -resume flag, it will run every process from scratch.","title":"Usage"},{"location":"usage/#beiko-labarete-usage","text":"","title":"beiko-lab/ARETE: Usage"},{"location":"usage/#introduction","text":"The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE.","title":"Introduction"},{"location":"usage/#samplesheet-input","text":"No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]'","title":"Samplesheet input"},{"location":"usage/#full-workflow-or-assembly-samplesheet","text":"The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline.","title":"Full workflow or assembly samplesheet"},{"location":"usage/#annotation-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline.","title":"Annotation only samplesheet"},{"location":"usage/#phylogenomics-and-pangenomics-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow.","title":"Phylogenomics and Pangenomics only samplesheet"},{"location":"usage/#reference-genome","text":"For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta","title":"Reference Genome"},{"location":"usage/#running-the-pipeline","text":"The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation.","title":"Running the pipeline"},{"location":"usage/#assembly-entry","text":"To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly Entry"},{"location":"usage/#assembly-qc-entry","text":"To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly QC Entry"},{"location":"usage/#annotation-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"Annotation Entry"},{"location":"usage/#poppunk-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"PopPUNK Entry"},{"location":"usage/#phylogenomics-and-pangenomics-entry","text":"To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker","title":"Phylogenomics and Pangenomics Entry"},{"location":"usage/#rspr-entry","text":"To execute the rSPR analysis on pre-existing trees: nextflow run beiko-lab/ARETE \\ -entry rspr \\ --input_sample_table samplesheet.csv \\ --core_gene_tree core_gene_alignment.tre \\ --concatenated_annotation BAKTA.txt \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --concatenated_annotation - The tabular annotation results (TSV) for all genomes, like the ones generated at the end of Prokka or Bakta in ARETE. Although useful, it's not necessary to execute the rSPR entry. --input_sample_table - A samplesheet containing all individual gene trees in the following format: gene_tree,path CDS_0000,/path/to/CDS_0000.tre CDS_0001,/path/to/CDS_0001.tre CDS_0002,/path/to/CDS_0002.tre CDS_0003,/path/to/CDS_0003.tre CDS_0004,/path/to/CDS_0004.tre","title":"rSPR Entry"},{"location":"usage/#evolccm-entry","text":"To execute the evolCCM analysis on a pre-existing reference tree and feature profile: nextflow run beiko-lab/ARETE \\ -entry evolccm \\ --core_gene_tree core_gene_alignment.tre \\ --feature_profile feature_profile.tsv.gz \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --feature_profile - A presence/absence TSV matrix of features in genomes. Genome names should be the same in the core tree and should be contained to a 'genome_id' column, with all other columns represent features absent (0) or present (1) in each genome. I.e.: genome_id plasmid_AA155 plasmid_AA161 ED010 0 0 ED017 0 1 ED040 0 0 ED073 0 1 ED075 1 1 ED082 0 1 ED142 0 1 ED178 0 1 ED180 0 0","title":"evolCCM Entry"},{"location":"usage/#recombination-entry","text":"To execute the recombination analysis on pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE \\ -entry recombination \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ -profile docker","title":"Recombination Entry"},{"location":"usage/#gene-order-entry","text":"To execute the Gene Order analysis on pre-existing assemblies and RGI annotations: nextflow run beiko-lab/ARETE \\ -entry gene_order \\ --input_sample_table gene_order_samplesheet.csv \\ -profile docker --input_sample_table - A samplesheet containing a fasta file, a genbank file and an RGI output file for each assembly: sample,fna_file_path,gbk,rgi SAMD00052607,SAMD00052607.faa,SAMD00052607.gbk,SAMD00052607_rgi.txt SAMEA1466699,SAMEA1466699.faa,SAMEA1466699.gbk,SAMEA1466699_rgi.txt SAMEA1486355,SAMEA1486355.faa,SAMEA1486355.gbk,SAMEA1486355_rgi.txt","title":"Gene Order Entry"},{"location":"usage/#updating-the-pipeline","text":"When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE","title":"Updating the pipeline"},{"location":"usage/#reproducibility","text":"It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.","title":"Reproducibility"},{"location":"usage/#core-nextflow-arguments","text":"NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).","title":"Core Nextflow arguments"},{"location":"usage/#-profile","text":"Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters","title":"-profile"},{"location":"usage/#-resume","text":"Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names.","title":"-resume"},{"location":"usage/#-c","text":"Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.","title":"-c"},{"location":"usage/#custom-resource-requests","text":"Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information.","title":"Custom resource requests"},{"location":"usage/#running-in-the-background","text":"Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).","title":"Running in the background"},{"location":"usage/#nextflow-memory-requirements","text":"In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g' Sometimes LevelDB, which is used by Nextflow to track execution metadata, can lead to memory-related issues, often showing as a SIGBUS error. This tends to happen when running Nextflow in SLURM environments . In this case, setting NXF_OPTS=\"-Dleveldb.mmap=false\" in your ~/.bashrc or immediately before executing nextflow run usually solves the issue.","title":"Nextflow memory requirements"},{"location":"usage/#aretes-storage-requirements","text":"ARETE generates a lot of intermediary files, which is even further exacerbated if you are running on a dataset with more than 100 genomes. Before running ARETE you should make sure you have at least 500 GB of free storage. After running ARETE and checking your results, you can remove the work/ directory in your working directory, which is where Nextflow stores its cache. Be aware that deleting work/ will make it so your pipeline won't re-run with cache when using the -resume flag, it will run every process from scratch.","title":"ARETE's storage requirements"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"Check out the full ARETE documentation for more information What is ARETE? ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, cosegregation, coevolution, and phylogenetic trees comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be easily fed into packages such as Coeus and MicroReact for further analyses. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some taxonomic groups than others. In general, the tools in ARETE work best at the species and genus level of relatedness. A key design feature of ARETE is the versatility to find the right blend of software packages and parameter settings that best handle datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets from fewer than ten to over 10,000 genomes from a diversity of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key feature is enabling the user choice to run specific subsets of the pipeline; a user may already have assembled genomes, or they may not care about, say, recombination detection. There are also cases where it might be necessary to manually review the outputs from a particular step before moving on to the next one; ARETE makes this manual QC easy to do. Table of Contents About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE About the pipeline The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best-practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow organization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting The user can optionally subdivide their set of genomes into related lineages identified by PopPUNK ( See documentation ). PopPUNK quickly assignes genomes to 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome from each lineage. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced and their own specific knowledge of the genetic population structure of the taxon being analyzed. Short-read processing and assembly Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection ( optionally ) Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution ( optionally ) Identification of coordinated gain and loss of features using EvolCCM ( EvolCCM ) Lateral gene transfer ( optionally ) Phylogenetic inference of LGT using rSPR ( rSPR ) Gene order ( optionally ) Comparison of genomic neighbourhoods using the Gene Order Workflow ( Gene Order Workflow ) See our roadmap for a full list of future development targets. Quick Start Install nextflow Install Docker or Singularity . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. In case of input datasets larger than 100 samples, check our resource profiles documentation , for optimal usage. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 , with the first column being sample names and the other two corresponding to compressed FASTQ files. Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all parameters currently implemented in the pipeline and which ones are required. See the FAQ for a list of frequently asked questions and common issues. Testing To test the worklow on a minimal dataset you can use the test configuration (with either docker or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker To accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light We also provide a larger test dataset, under -profile test_full , for use in ARETE's annotation entry. This dataset is comprised of 8 bacterial genomes. As a note, this can take upwards of 20 minutes to complete on an average personal computer . Replace docker below as appropriate. nextflow run beiko-lab/ARETE -entry annotation -profile test_full,docker Examples The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples of how runs can be adjusted to accommodate genome sets of different sizes: Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 300-ish genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ -entry annotation \\ -profile medium,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile medium,docker - Run tools in docker containers. For -profile medium , check our resource requirements documentation . Annotation to evolutionary dynamics on 10,000 genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ -entry annotation \\ -profile large,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile large,docker - Run tools in docker containers. For -profile large , check our resource requirements documentation . Annotation on a tiny dataset (4-12 genomes) in a personal computer While ARETE is primarily designed to run in HPC clusters, we have implemented a simple, bare-bones version that is able to run on most modern computers and laptops, with at most 6 CPU cores and a minimum of 8GB of memory. Keep in mind this will make it impossible to run most tools included in ARETE, but it should still provide a useful testing ground. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm \\ -entry annotation \\ -profile light,docker Note the addition of the light profile, this is the configuration for running on personal computers. Check out how to assign resource requests for even more customization. Run all ARETE subworkflows in a small dataset The command below will run all tools included in the annotation subworkflow and will enable the recombination, gene order, rSPR and evolCCM subworkflows. Be aware that the performance of the evolCCM and Gene Order subworkflows with large or very diverse datasets can be subpar. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,integronfinder,report' \\ --run_recombination \\ --run_evolccm \\ --run_rspr \\ --run_gene_order \\ --poppunk_model dbscan \\ -profile docker Credits The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada. Contributing to ARETE If you would like to contribute to ARETE, please see the contributing guidelines . Citing ARETE Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Home"},{"location":"#what-is-arete","text":"ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, cosegregation, coevolution, and phylogenetic trees comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be easily fed into packages such as Coeus and MicroReact for further analyses. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some taxonomic groups than others. In general, the tools in ARETE work best at the species and genus level of relatedness. A key design feature of ARETE is the versatility to find the right blend of software packages and parameter settings that best handle datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets from fewer than ten to over 10,000 genomes from a diversity of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key feature is enabling the user choice to run specific subsets of the pipeline; a user may already have assembled genomes, or they may not care about, say, recombination detection. There are also cases where it might be necessary to manually review the outputs from a particular step before moving on to the next one; ARETE makes this manual QC easy to do.","title":"What is ARETE?"},{"location":"#table-of-contents","text":"About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE","title":"Table of Contents"},{"location":"#about-the-pipeline","text":"The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best-practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow organization provides suitable entry and exit points for users who want to run only a portion of the full pipeline.","title":"About the pipeline"},{"location":"#genome-subsetting","text":"The user can optionally subdivide their set of genomes into related lineages identified by PopPUNK ( See documentation ). PopPUNK quickly assignes genomes to 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome from each lineage. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced and their own specific knowledge of the genetic population structure of the taxon being analyzed.","title":"Genome subsetting"},{"location":"#short-read-processing-and-assembly","text":"Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm )","title":"Short-read processing and assembly"},{"location":"#annotation","text":"Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond )","title":"Annotation"},{"location":"#phylogenomics","text":"( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites )","title":"Phylogenomics"},{"location":"#recombination-detection-optionally","text":"Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins )","title":"Recombination detection (optionally)"},{"location":"#coevolution","text":"( optionally ) Identification of coordinated gain and loss of features using EvolCCM ( EvolCCM )","title":"Coevolution"},{"location":"#lateral-gene-transfer","text":"( optionally ) Phylogenetic inference of LGT using rSPR ( rSPR )","title":"Lateral gene transfer"},{"location":"#gene-order","text":"( optionally ) Comparison of genomic neighbourhoods using the Gene Order Workflow ( Gene Order Workflow ) See our roadmap for a full list of future development targets.","title":"Gene order"},{"location":"#quick-start","text":"Install nextflow Install Docker or Singularity . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. In case of input datasets larger than 100 samples, check our resource profiles documentation , for optimal usage. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 , with the first column being sample names and the other two corresponding to compressed FASTQ files. Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all parameters currently implemented in the pipeline and which ones are required. See the FAQ for a list of frequently asked questions and common issues.","title":"Quick Start"},{"location":"#testing","text":"To test the worklow on a minimal dataset you can use the test configuration (with either docker or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker To accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light We also provide a larger test dataset, under -profile test_full , for use in ARETE's annotation entry. This dataset is comprised of 8 bacterial genomes. As a note, this can take upwards of 20 minutes to complete on an average personal computer . Replace docker below as appropriate. nextflow run beiko-lab/ARETE -entry annotation -profile test_full,docker","title":"Testing"},{"location":"#examples","text":"The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples of how runs can be adjusted to accommodate genome sets of different sizes:","title":"Examples"},{"location":"#assembly-annotation-and-pan-genome-inference-from-a-modestly-sized-dataset-50-or-so-genomes-from-paired-end-reads","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers.","title":"Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads"},{"location":"#annotation-to-evolutionary-dynamics-on-300-ish-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ -entry annotation \\ -profile medium,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile medium,docker - Run tools in docker containers. For -profile medium , check our resource requirements documentation .","title":"Annotation to evolutionary dynamics on 300-ish genomes"},{"location":"#annotation-to-evolutionary-dynamics-on-10000-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ -entry annotation \\ -profile large,docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile large,docker - Run tools in docker containers. For -profile large , check our resource requirements documentation .","title":"Annotation to evolutionary dynamics on 10,000 genomes"},{"location":"#annotation-on-a-tiny-dataset-4-12-genomes-in-a-personal-computer","text":"While ARETE is primarily designed to run in HPC clusters, we have implemented a simple, bare-bones version that is able to run on most modern computers and laptops, with at most 6 CPU cores and a minimum of 8GB of memory. Keep in mind this will make it impossible to run most tools included in ARETE, but it should still provide a useful testing ground. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm \\ -entry annotation \\ -profile light,docker Note the addition of the light profile, this is the configuration for running on personal computers. Check out how to assign resource requests for even more customization.","title":"Annotation on a tiny dataset (4-12 genomes) in a personal computer"},{"location":"#run-all-arete-subworkflows-in-a-small-dataset","text":"The command below will run all tools included in the annotation subworkflow and will enable the recombination, gene order, rSPR and evolCCM subworkflows. Be aware that the performance of the evolCCM and Gene Order subworkflows with large or very diverse datasets can be subpar. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,integronfinder,report' \\ --run_recombination \\ --run_evolccm \\ --run_rspr \\ --run_gene_order \\ --poppunk_model dbscan \\ -profile docker","title":"Run all ARETE subworkflows in a small dataset"},{"location":"#credits","text":"The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada.","title":"Credits"},{"location":"#contributing-to-arete","text":"If you would like to contribute to ARETE, please see the contributing guidelines .","title":"Contributing to ARETE"},{"location":"#citing-arete","text":"Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Citing ARETE"},{"location":"CITATIONS/","text":"beiko-lab/ARETE: Citations An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. nf-core Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. Nextflow Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. Pipeline tools CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 EvolCCM Chaoyue Liu and others, The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes Based on Phylogenetic Profiles, Systematic Biology, Volume 72, Issue 3, May 2023, Pages 559\u2013574, https://doi.org/10.1093/sysbio/syac052 rSPR Christopher Whidden, Norbert Zeh, Robert G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft Distance, Systematic Biology, Volume 63, Issue 4, July 2014, Pages 566\u2013581, https://doi.org/10.1093/sysbio/syu023 Software packaging/containerisation tools BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Citations"},{"location":"CITATIONS/#beiko-labarete-citations","text":"An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038.","title":"beiko-lab/ARETE: Citations"},{"location":"CITATIONS/#nf-core","text":"Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.","title":"nf-core"},{"location":"CITATIONS/#nextflow","text":"Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.","title":"Nextflow"},{"location":"CITATIONS/#pipeline-tools","text":"CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 EvolCCM Chaoyue Liu and others, The Community Coevolution Model with Application to the Study of Evolutionary Relationships between Genes Based on Phylogenetic Profiles, Systematic Biology, Volume 72, Issue 3, May 2023, Pages 559\u2013574, https://doi.org/10.1093/sysbio/syac052 rSPR Christopher Whidden, Norbert Zeh, Robert G. Beiko, Supertrees Based on the Subtree Prune-and-Regraft Distance, Systematic Biology, Volume 63, Issue 4, July 2014, Pages 566\u2013581, https://doi.org/10.1093/sysbio/syu023","title":"Pipeline tools"},{"location":"CITATIONS/#software-packagingcontainerisation-tools","text":"BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Software packaging/containerisation tools"},{"location":"ROADMAP/","text":"A list in no particular order of outstanding development features, both in-progress and planned: Integration of additional tools and scripts: Partner applications for analysis and visualization of phylogenetic distributions of genes and MGEs and gene-order clustering (For example, Coeus ).","title":"Roadmap"},{"location":"contributing/","text":"beiko-lab/ARETE: Contributing Guidelines Hey! Thank you for taking an interest in contributing to ARETE. We use GitHub for managing issues, contribution requests and everything else. So feel free to communicate with us using new issues and discussions, whatever best fits your idea for your contribution Contribution Workflow The standard workflow for contributing to ARETE is as follows: Check first if there isn't already an issue for your feature request, bug, etc. If there isn't one, you should create a new issue or discussion for your planned contribution before starting working on it . Fork the beiko-lab/ARETE repository to your GitHub account Make the necessary changes or additions within your forked repository. You should probably create a new branch for your contribution, instead of committing directly to the master branch of your repository . In case any parameters were added or changed, use nf-core schema build to add them to the pipeline JSON schema and nf-core schema docs --output docs/params.md --force to update the respective documentation (requires nf-core tools >= 1.10). Optionally run the pipeline's unit tests locally using nf-test : nf-test test tests/subworkflows/local/* . Submit a Pull Request to our master branch and wait for your changes to be reviewed and merged by our maintainers. Our Github Actions workflows should perform a few pipeline tests automatically after receiving your pull request. Any errors on them should be looked at since they could point to underlying issues in your changes.","title":"Contributing"},{"location":"contributing/#beiko-labarete-contributing-guidelines","text":"Hey! Thank you for taking an interest in contributing to ARETE. We use GitHub for managing issues, contribution requests and everything else. So feel free to communicate with us using new issues and discussions, whatever best fits your idea for your contribution","title":"beiko-lab/ARETE: Contributing Guidelines"},{"location":"contributing/#contribution-workflow","text":"The standard workflow for contributing to ARETE is as follows: Check first if there isn't already an issue for your feature request, bug, etc. If there isn't one, you should create a new issue or discussion for your planned contribution before starting working on it . Fork the beiko-lab/ARETE repository to your GitHub account Make the necessary changes or additions within your forked repository. You should probably create a new branch for your contribution, instead of committing directly to the master branch of your repository . In case any parameters were added or changed, use nf-core schema build to add them to the pipeline JSON schema and nf-core schema docs --output docs/params.md --force to update the respective documentation (requires nf-core tools >= 1.10). Optionally run the pipeline's unit tests locally using nf-test : nf-test test tests/subworkflows/local/* . Submit a Pull Request to our master branch and wait for your changes to be reviewed and merged by our maintainers. Our Github Actions workflows should perform a few pipeline tests automatically after receiving your pull request. Any errors on them should be looked at since they could point to underlying issues in your changes.","title":"Contribution Workflow"},{"location":"faq/","text":"Frequently Asked Questions How do I run ARETE in a Slurm HPC environment? Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account=' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ). Do note that there could be memory-related issues when running Nextflow in SLURM environments. Can I use the ARETE outputs in MicroReact? Yes you can! In fact, ARETE provides many outputs that can be used in the MicroReact web app. Some of these files are: The PopPUNK lineages tree under poppunk_results/poppunk_visualizations/poppunk_visualizations.microreact . The reference tree built with FastTree under phylogenomics/reference_tree/core_gene_alignment.tre . The annotation feature profile annotation/feature_profile.tsv.gz . This file contains the annotation features in a presence/absence matrix format. Since MicroReact doesn't allow compressed files, just make sure to decompress it before-hand: gunzip feature_profile.tsv.gz Make sure to check our output documentation for a full list of outputs and the parameter documentation for a description of parameters to enable and disable these outputs. Why am I getting this 'docker: Permission denied' error? Although previous ARETE users have reported this issue, this is neither an issue with Nextflow nor with ARETE itself. This is most likely due to how Docker permissions are set up on your machine. If running on your own machine, take a look at this guide . If running on an HPC system, talk to your system administrator or consider running ARETE with Singularity . My server doesn't have that much memory! How do I change the resource requirements? Just write a file called nextflow.config in your working directory and add the following to it: process { withLabel:process_low { cpus = 6 memory = 8.GB time = 4.h } withLabel:process_medium { cpus = 12 memory = 36.GB time = 8.h } withLabel:process_high { cpus = 16 memory = 72.GB time = 20.h } withLabel:process_high_memory { memory = 200.GB } withName: MOB_RECON { cpus = 2 } } Feel free to change the values above as you wish and then add -c nextflow.config to your nextflow run beiko-lab/ARETE command. You can point to general process labels, like process_low , or you can point directly to process names, like MOB_RECON . Learn more at our usage documentation or the official nextflow documentation .","title":"FAQ"},{"location":"faq/#frequently-asked-questions","text":"","title":"Frequently Asked Questions"},{"location":"faq/#how-do-i-run-arete-in-a-slurm-hpc-environment","text":"Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account=' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ). Do note that there could be memory-related issues when running Nextflow in SLURM environments.","title":"How do I run ARETE in a Slurm HPC environment?"},{"location":"faq/#can-i-use-the-arete-outputs-in-microreact","text":"Yes you can! In fact, ARETE provides many outputs that can be used in the MicroReact web app. Some of these files are: The PopPUNK lineages tree under poppunk_results/poppunk_visualizations/poppunk_visualizations.microreact . The reference tree built with FastTree under phylogenomics/reference_tree/core_gene_alignment.tre . The annotation feature profile annotation/feature_profile.tsv.gz . This file contains the annotation features in a presence/absence matrix format. Since MicroReact doesn't allow compressed files, just make sure to decompress it before-hand: gunzip feature_profile.tsv.gz Make sure to check our output documentation for a full list of outputs and the parameter documentation for a description of parameters to enable and disable these outputs.","title":"Can I use the ARETE outputs in MicroReact?"},{"location":"faq/#why-am-i-getting-this-docker-permission-denied-error","text":"Although previous ARETE users have reported this issue, this is neither an issue with Nextflow nor with ARETE itself. This is most likely due to how Docker permissions are set up on your machine. If running on your own machine, take a look at this guide . If running on an HPC system, talk to your system administrator or consider running ARETE with Singularity .","title":"Why am I getting this 'docker: Permission denied' error?"},{"location":"faq/#my-server-doesnt-have-that-much-memory-how-do-i-change-the-resource-requirements","text":"Just write a file called nextflow.config in your working directory and add the following to it: process { withLabel:process_low { cpus = 6 memory = 8.GB time = 4.h } withLabel:process_medium { cpus = 12 memory = 36.GB time = 8.h } withLabel:process_high { cpus = 16 memory = 72.GB time = 20.h } withLabel:process_high_memory { memory = 200.GB } withName: MOB_RECON { cpus = 2 } } Feel free to change the values above as you wish and then add -c nextflow.config to your nextflow run beiko-lab/ARETE command. You can point to general process labels, like process_low , or you can point directly to process names, like MOB_RECON . Learn more at our usage documentation or the official nextflow documentation .","title":"My server doesn't have that much memory! How do I change the resource requirements?"},{"location":"issues/","text":"Known issues in ARETE PopPUNK We have experienced issues with PopPUNK in ARETE runs, primarily related to the distances and clusters generated and how these affect both the subsampling and recombination subworkflows. Sometimes the distances generated are too small or the number of clusters changes between executions of the same dataset. While we can't solve the latter since it is a result of how PopPUNK itself is defined, the former can be mitigated by adjusting the subsampling thresholds with --core_similarity (99.9 by default) and --accessory_similarity (99 by default), or disabling subsampling altogether ( --enable_subsetting false ). A useful course of action is to first run your dataset through the PopPUNK entry , and then choose the appropriate parameters for your final pipeline run. The PopPUNK entry takes at most 2 hours to run, even on large datasets . Disabling PopPUNK in your execution is also simple to do with --skip_poppunk . rSPR rSPR, which you can enable with --run_rspr or with the rSPR entry , is known to be very slow , especially with larger datasets. The default of 3 days for rSPR runtimes should be enough for some runs, but for most larger datasets it won't be sufficient. In this case, if you do want to run rSPR, we suggest two possible routes: Increasing the default time allocation for the RSPR_EXACT processes. Check out how . Ignoring timeout errors altogether and finish the pipeline execution with whatever finished running in these 3 days. This is the default course of action for ARETE . By choosing the second course of action, we ignore timeout errors generated with RSPR_EXACT and finish the execution of downstream processes, i.e. RSPR_HEATMAP , with whatever results we already have. This process generates a heatmap of Tree size and Exact rSPR distance. While RSPR_HEATMAP should execute with the results that were generated up to the timeout, we have heard from users that this process can still not run, even when results from RSPR_EXACT were generated. This issue has only been reported with older versions of nextflow, v23 onwards should work fine . While this issue is unfortunate it shouldn't be a big problem: The only output given by RSPR_HEATMAP is the aforementioned heatmap, which can also be generated externally by using our rspr_heatmap.py script or your own downstream analysis.","title":"Known issues"},{"location":"issues/#known-issues-in-arete","text":"","title":"Known issues in ARETE"},{"location":"issues/#poppunk","text":"We have experienced issues with PopPUNK in ARETE runs, primarily related to the distances and clusters generated and how these affect both the subsampling and recombination subworkflows. Sometimes the distances generated are too small or the number of clusters changes between executions of the same dataset. While we can't solve the latter since it is a result of how PopPUNK itself is defined, the former can be mitigated by adjusting the subsampling thresholds with --core_similarity (99.9 by default) and --accessory_similarity (99 by default), or disabling subsampling altogether ( --enable_subsetting false ). A useful course of action is to first run your dataset through the PopPUNK entry , and then choose the appropriate parameters for your final pipeline run. The PopPUNK entry takes at most 2 hours to run, even on large datasets . Disabling PopPUNK in your execution is also simple to do with --skip_poppunk .","title":"PopPUNK"},{"location":"issues/#rspr","text":"rSPR, which you can enable with --run_rspr or with the rSPR entry , is known to be very slow , especially with larger datasets. The default of 3 days for rSPR runtimes should be enough for some runs, but for most larger datasets it won't be sufficient. In this case, if you do want to run rSPR, we suggest two possible routes: Increasing the default time allocation for the RSPR_EXACT processes. Check out how . Ignoring timeout errors altogether and finish the pipeline execution with whatever finished running in these 3 days. This is the default course of action for ARETE . By choosing the second course of action, we ignore timeout errors generated with RSPR_EXACT and finish the execution of downstream processes, i.e. RSPR_HEATMAP , with whatever results we already have. This process generates a heatmap of Tree size and Exact rSPR distance. While RSPR_HEATMAP should execute with the results that were generated up to the timeout, we have heard from users that this process can still not run, even when results from RSPR_EXACT were generated. This issue has only been reported with older versions of nextflow, v23 onwards should work fine . While this issue is unfortunate it shouldn't be a big problem: The only output given by RSPR_HEATMAP is the aforementioned heatmap, which can also be generated externally by using our rspr_heatmap.py script or your own downstream analysis.","title":"rSPR"},{"location":"output/","text":"beiko-lab/ARETE: Output Introduction The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. Pipeline overview The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements annotation_report.tsv.gz - A tabular file aggregating annotation data from all genomes feature_profile.tsv.gz - A presence absence matrix of features in all genomes IslandPath, PhiSpy and IntegronFinder results are currently not added to the final annotation report. We seek to fix this issue in the future. PopPUNK Subworkflow PopPUNK - Genome clustering Dynamics EvolCCM - Community Coevolution rSPR - rooted subtree-prune-and-regraft distances Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Gene Order Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline Assembly FastQC read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. fastp read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads. Kraken2 read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination. Unicycler assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE. Quast assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast. Annotation Bakta annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs Prokka annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. RGI annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models. MobRecon annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases DIAMOND annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively. IslandPath annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes. IntegronFinder Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes. PhiSpy annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions. PopPUNK poppunk_results/ poppunk_db/ - Results from PopPUNK's create-db command poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes. Phylogenomics and Pangenomics Panaroo pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline. PPanGGoLiN pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes FastTree phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences IQTree phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. SNPsites phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment. Dynamics EvolCCM dynamics/EvolCCM/ EvolCCM_*tsv EvolCCM_*pvals EvolCCM_*X2 EvolCCM_*tre EvolCCM is the R implementation for CCM (Community Coevolution Model) rSPR The outputs are approximate and exact Subtree Prune and Regraft (rSPR) distances between pairs of rooted phylogenetic trees. Each CSV file contains these distances and the tree sizes. The PNG files are heatmaps of these distances and their respective tree sizes. dynamics/rSPR/ approx - Approximate rSPR distances exact - Exact rSPR distances rSPR is a software package for calculating rooted subtree-prune-and-regraft distances and rooted agreement forests. Recombination Verticall dynamics/recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination. SKA2 dynamics/recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers. Gubbins dynamics/recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Gene Order gene-order/ extraction/ - AMR genes of interest and their neighborhoods extracted from the assemblies. diamond/ - Pairwise alignments between all input genomes. clustering/ - Similarity and distance matrices for each AMR gene clustered via UPGMA, MCL and DBSCAN to identify similarities between their neighborhoods across all genomes. Gene Order is a subworkflow for bacterial gene order analysis, with outputs easily explorable through its partner visualization application Coeus . Pipeline information pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. MultiQC multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"Output"},{"location":"output/#beiko-labarete-output","text":"","title":"beiko-lab/ARETE: Output"},{"location":"output/#introduction","text":"The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.","title":"Introduction"},{"location":"output/#pipeline-overview","text":"The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements annotation_report.tsv.gz - A tabular file aggregating annotation data from all genomes feature_profile.tsv.gz - A presence absence matrix of features in all genomes IslandPath, PhiSpy and IntegronFinder results are currently not added to the final annotation report. We seek to fix this issue in the future. PopPUNK Subworkflow PopPUNK - Genome clustering Dynamics EvolCCM - Community Coevolution rSPR - rooted subtree-prune-and-regraft distances Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Gene Order Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline","title":"Pipeline overview"},{"location":"output/#assembly","text":"","title":"Assembly"},{"location":"output/#fastqc","text":"read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.","title":"FastQC"},{"location":"output/#fastp","text":"read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads.","title":"fastp"},{"location":"output/#kraken2","text":"read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination.","title":"Kraken2"},{"location":"output/#unicycler","text":"assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE.","title":"Unicycler"},{"location":"output/#quast","text":"assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast.","title":"Quast"},{"location":"output/#annotation","text":"","title":"Annotation"},{"location":"output/#bakta","text":"annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs","title":"Bakta"},{"location":"output/#prokka","text":"annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.","title":"Prokka"},{"location":"output/#rgi","text":"annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models.","title":"RGI"},{"location":"output/#mobrecon","text":"annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases","title":"MobRecon"},{"location":"output/#diamond","text":"annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively.","title":"DIAMOND"},{"location":"output/#islandpath","text":"annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes.","title":"IslandPath"},{"location":"output/#integronfinder","text":"Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes.","title":"IntegronFinder"},{"location":"output/#phispy","text":"annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.","title":"PhiSpy"},{"location":"output/#poppunk","text":"poppunk_results/ poppunk_db/ - Results from PopPUNK's create-db command poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes.","title":"PopPUNK"},{"location":"output/#phylogenomics-and-pangenomics","text":"","title":"Phylogenomics and Pangenomics"},{"location":"output/#panaroo","text":"pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline.","title":"Panaroo"},{"location":"output/#ppanggolin","text":"pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes","title":"PPanGGoLiN"},{"location":"output/#fasttree","text":"phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences","title":"FastTree"},{"location":"output/#iqtree","text":"phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood.","title":"IQTree"},{"location":"output/#snpsites","text":"phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment.","title":"SNPsites"},{"location":"output/#dynamics","text":"","title":"Dynamics"},{"location":"output/#evolccm","text":"dynamics/EvolCCM/ EvolCCM_*tsv EvolCCM_*pvals EvolCCM_*X2 EvolCCM_*tre EvolCCM is the R implementation for CCM (Community Coevolution Model)","title":"EvolCCM"},{"location":"output/#rspr","text":"The outputs are approximate and exact Subtree Prune and Regraft (rSPR) distances between pairs of rooted phylogenetic trees. Each CSV file contains these distances and the tree sizes. The PNG files are heatmaps of these distances and their respective tree sizes. dynamics/rSPR/ approx - Approximate rSPR distances exact - Exact rSPR distances rSPR is a software package for calculating rooted subtree-prune-and-regraft distances and rooted agreement forests.","title":"rSPR"},{"location":"output/#recombination","text":"","title":"Recombination"},{"location":"output/#verticall","text":"dynamics/recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination.","title":"Verticall"},{"location":"output/#ska2","text":"dynamics/recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers.","title":"SKA2"},{"location":"output/#gubbins","text":"dynamics/recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions.","title":"Gubbins"},{"location":"output/#gene-order","text":"gene-order/ extraction/ - AMR genes of interest and their neighborhoods extracted from the assemblies. diamond/ - Pairwise alignments between all input genomes. clustering/ - Similarity and distance matrices for each AMR gene clustered via UPGMA, MCL and DBSCAN to identify similarities between their neighborhoods across all genomes. Gene Order is a subworkflow for bacterial gene order analysis, with outputs easily explorable through its partner visualization application Coeus .","title":"Gene Order"},{"location":"output/#pipeline-information","text":"pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.","title":"Pipeline information"},{"location":"output/#multiqc","text":"multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"MultiQC"},{"location":"params/","text":"beiko-lab/ARETE pipeline parameters AMR/VF LGT-focused bacterial genomics workflow Input/output options Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string Reference genome options Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string QC Parameter Description Type Default Required Hidden run_checkm Run CheckM QC software boolean apply_filtering Filter assemblies on QC results boolean skip_kraken Don't run Kraken2 taxonomic classification boolean min_n50 Minimum N50 for filtering integer 10000 min_contigs_1000_bp Minimum number of contigs with >1000bp integer 1 min_contig_length Minimum average contig length integer 1 Annotation Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet Phylogenomics Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True PopPUNK Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99.0 Gene Order Parameters for the Gene Order Subworkflow Parameter Description Type Default Required Hidden run_gene_order Whether to run the Gene Order subworkflow boolean gene_order_percent_cutoff Cutoff percentage of genomes a gene should be present within to be included in extraction and subsequent analysis. Should a float between 0 and 1 (e.g., 0.25 means only genes present in a minimum of 25% of genomes are kept). number 0.25 gene_order_label_cols If using annotation files predicting features, list of space separated column names to be added to the gene names string None num_neighbors Neighborhood size to extract. Should be an even number N, such that N/2 neighbors upstream and N/2 neighbors downstream will be analyzed. integer 10 inflation Inflation hyperparameter value for Markov Clustering Algorithm. integer 2 epsilon Epsilon hyperparameter value for DBSCAN clustering. number 0.5 minpts Minpts hyperparameter value for DBSCAN clustering. integer 5 plot_clustering Create Clustering HTML Plots boolean Recombination Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean Dynamics Parameter Description Type Default Required Hidden run_evolccm Run the community coevolution model boolean run_rspr Run rSPR boolean min_rspr_distance Minimum rSPR distance used to define processing groups integer 10 min_branch_length Minimum rSPR branch length integer 0 max_support_threshold Maximum rSPR support threshold number 0.7 max_approx_rspr Maximum approximate rSPR distance for filtering integer -1 min_heatmap_approx_rspr Minimum approximate rSPR distance used to generate heatmap integer 0 max_heatmap_approx_rspr Maximum approximate rSPR distance used to generate heatmap integer -1 min_heatmap_exact_rspr Minimum exact rSPR distance used to generate heatmap integer 0 max_heatmap_exact_rspr Maximum exact rSPR distance used to generate heatmap integer -1 core_gene_tree Core (or reference) genome tree. Used in the rSPR and evolCCM entries. string concatenated_annotation TSV table of annotations for all genomes. Such as the ones generated by Bakta or Prokka in ARETE. string feature_profile Feature profile TSV (A presence-absence matrix). Used in the evolCCM entry. string Institutional config options Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True Max job request options Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True Generic options Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string True","title":"Parameters"},{"location":"params/#beiko-labarete-pipeline-parameters","text":"AMR/VF LGT-focused bacterial genomics workflow","title":"beiko-lab/ARETE pipeline parameters"},{"location":"params/#inputoutput-options","text":"Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string","title":"Input/output options"},{"location":"params/#reference-genome-options","text":"Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string","title":"Reference genome options"},{"location":"params/#qc","text":"Parameter Description Type Default Required Hidden run_checkm Run CheckM QC software boolean apply_filtering Filter assemblies on QC results boolean skip_kraken Don't run Kraken2 taxonomic classification boolean min_n50 Minimum N50 for filtering integer 10000 min_contigs_1000_bp Minimum number of contigs with >1000bp integer 1 min_contig_length Minimum average contig length integer 1","title":"QC"},{"location":"params/#annotation","text":"Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet","title":"Annotation"},{"location":"params/#phylogenomics","text":"Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True","title":"Phylogenomics"},{"location":"params/#poppunk","text":"Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99.0","title":"PopPUNK"},{"location":"params/#gene-order","text":"Parameters for the Gene Order Subworkflow Parameter Description Type Default Required Hidden run_gene_order Whether to run the Gene Order subworkflow boolean gene_order_percent_cutoff Cutoff percentage of genomes a gene should be present within to be included in extraction and subsequent analysis. Should a float between 0 and 1 (e.g., 0.25 means only genes present in a minimum of 25% of genomes are kept). number 0.25 gene_order_label_cols If using annotation files predicting features, list of space separated column names to be added to the gene names string None num_neighbors Neighborhood size to extract. Should be an even number N, such that N/2 neighbors upstream and N/2 neighbors downstream will be analyzed. integer 10 inflation Inflation hyperparameter value for Markov Clustering Algorithm. integer 2 epsilon Epsilon hyperparameter value for DBSCAN clustering. number 0.5 minpts Minpts hyperparameter value for DBSCAN clustering. integer 5 plot_clustering Create Clustering HTML Plots boolean","title":"Gene Order"},{"location":"params/#recombination","text":"Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean","title":"Recombination"},{"location":"params/#dynamics","text":"Parameter Description Type Default Required Hidden run_evolccm Run the community coevolution model boolean run_rspr Run rSPR boolean min_rspr_distance Minimum rSPR distance used to define processing groups integer 10 min_branch_length Minimum rSPR branch length integer 0 max_support_threshold Maximum rSPR support threshold number 0.7 max_approx_rspr Maximum approximate rSPR distance for filtering integer -1 min_heatmap_approx_rspr Minimum approximate rSPR distance used to generate heatmap integer 0 max_heatmap_approx_rspr Maximum approximate rSPR distance used to generate heatmap integer -1 min_heatmap_exact_rspr Minimum exact rSPR distance used to generate heatmap integer 0 max_heatmap_exact_rspr Maximum exact rSPR distance used to generate heatmap integer -1 core_gene_tree Core (or reference) genome tree. Used in the rSPR and evolCCM entries. string concatenated_annotation TSV table of annotations for all genomes. Such as the ones generated by Bakta or Prokka in ARETE. string feature_profile Feature profile TSV (A presence-absence matrix). Used in the evolCCM entry. string","title":"Dynamics"},{"location":"params/#institutional-config-options","text":"Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True","title":"Institutional config options"},{"location":"params/#max-job-request-options","text":"Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True","title":"Max job request options"},{"location":"params/#generic-options","text":"Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string True","title":"Generic options"},{"location":"resource_profiles/","text":"ARETE and dataset size Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call small ), the medium profile and the large profile. These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets. If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the PopPUNK entry . One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into. The sizes are: For the default or small profile, we expect datasets with 100 samples/assemblies or fewer. It runs on the default pipeline parameters, with no changes. For the medium profile, we expect datasets with >100 and <1000 samples. It increases the default resource requirements for most processes and also uses PPanGGoLiN for pangenome construction, instead of Panaroo . For the large profile, we expect datasets with >1000 samples. It also increases default resource requirements for some processes and uses PPanGGoLin. Additionally, it enables PopPUNK subsampling , with default parameters . For the light profile, we expect datasets with at most 12 samples. This is a profile primarily designed to run on personal computers and it disables most ARETE processes.","title":"Dataset Size"},{"location":"resource_profiles/#arete-and-dataset-size","text":"Currently ARETE has three distinct profiles that change the pipeline execution in some ways: The default profile (which we can call small ), the medium profile and the large profile. These three profiles were developed based on the size and diversity of the input dataset and change some parameter defaults based on tests we have performed on similar-sized datasets. If you want to first gauge the potential diversity of your dataset and have some input assemblies you can try the PopPUNK entry . One of the outputs will provide insight into how many clusters, or lineages, your dataset divides into. The sizes are: For the default or small profile, we expect datasets with 100 samples/assemblies or fewer. It runs on the default pipeline parameters, with no changes. For the medium profile, we expect datasets with >100 and <1000 samples. It increases the default resource requirements for most processes and also uses PPanGGoLiN for pangenome construction, instead of Panaroo . For the large profile, we expect datasets with >1000 samples. It also increases default resource requirements for some processes and uses PPanGGoLin. Additionally, it enables PopPUNK subsampling , with default parameters . For the light profile, we expect datasets with at most 12 samples. This is a profile primarily designed to run on personal computers and it disables most ARETE processes.","title":"ARETE and dataset size"},{"location":"subsampling/","text":"PopPUNK subsetting The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ . Example command The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Subsampling"},{"location":"subsampling/#poppunk-subsetting","text":"The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ .","title":"PopPUNK subsetting"},{"location":"subsampling/#example-command","text":"The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Example command"},{"location":"usage/","text":"beiko-lab/ARETE: Usage Introduction The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE. Samplesheet input No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]' Full workflow or assembly samplesheet The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline. Annotation only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline. Phylogenomics and Pangenomics only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow. Reference Genome For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta Running the pipeline The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation. Assembly Entry To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Assembly QC Entry To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Annotation Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker PopPUNK Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker Phylogenomics and Pangenomics Entry To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker rSPR Entry To execute the rSPR analysis on pre-existing trees: nextflow run beiko-lab/ARETE \\ -entry rspr \\ --input_sample_table samplesheet.csv \\ --core_gene_tree core_gene_alignment.tre \\ --concatenated_annotation BAKTA.txt \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --concatenated_annotation - The tabular annotation results (TSV) for all genomes, like the ones generated at the end of Prokka or Bakta in ARETE. Although useful, it's not necessary to execute the rSPR entry. --input_sample_table - A samplesheet containing all individual gene trees in the following format: gene_tree,path CDS_0000,/path/to/CDS_0000.tre CDS_0001,/path/to/CDS_0001.tre CDS_0002,/path/to/CDS_0002.tre CDS_0003,/path/to/CDS_0003.tre CDS_0004,/path/to/CDS_0004.tre evolCCM Entry To execute the evolCCM analysis on a pre-existing reference tree and feature profile: nextflow run beiko-lab/ARETE \\ -entry evolccm \\ --core_gene_tree core_gene_alignment.tre \\ --feature_profile feature_profile.tsv.gz \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --feature_profile - A presence/absence TSV matrix of features in genomes. Genome names should be the same in the core tree and should be contained to a 'genome_id' column, with all other columns represent features absent (0) or present (1) in each genome. I.e.: genome_id plasmid_AA155 plasmid_AA161 ED010 0 0 ED017 0 1 ED040 0 0 ED073 0 1 ED075 1 1 ED082 0 1 ED142 0 1 ED178 0 1 ED180 0 0 Recombination Entry To execute the recombination analysis on pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE \\ -entry recombination \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ -profile docker Gene Order Entry To execute the Gene Order analysis on pre-existing assemblies and RGI annotations: nextflow run beiko-lab/ARETE \\ -entry gene_order \\ --input_sample_table gene_order_samplesheet.csv \\ -profile docker --input_sample_table - A samplesheet containing a fasta file, a genbank file and an RGI output file for each assembly: sample,fna_file_path,gbk,rgi SAMD00052607,SAMD00052607.faa,SAMD00052607.gbk,SAMD00052607_rgi.txt SAMEA1466699,SAMEA1466699.faa,SAMEA1466699.gbk,SAMEA1466699_rgi.txt SAMEA1486355,SAMEA1486355.faa,SAMEA1486355.gbk,SAMEA1486355_rgi.txt Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE Reproducibility It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. Core Nextflow arguments NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen). -profile Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters -resume Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names. -c Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information. Custom resource requests Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information. Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g' Sometimes LevelDB, which is used by Nextflow to track execution metadata, can lead to memory-related issues, often showing as a SIGBUS error. This tends to happen when running Nextflow in SLURM environments . In this case, setting NXF_OPTS=\"-Dleveldb.mmap=false\" in your ~/.bashrc or immediately before executing nextflow run usually solves the issue. ARETE's storage requirements ARETE generates a lot of intermediary files, which is even further exacerbated if you are running on a dataset with more than 100 genomes. Before running ARETE you should make sure you have at least 500 GB of free storage. After running ARETE and checking your results, you can remove the work/ directory in your working directory, which is where Nextflow stores its cache. Be aware that deleting work/ will make it so your pipeline won't re-run with cache when using the -resume flag, it will run every process from scratch.","title":"Usage"},{"location":"usage/#beiko-labarete-usage","text":"","title":"beiko-lab/ARETE: Usage"},{"location":"usage/#introduction","text":"The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE.","title":"Introduction"},{"location":"usage/#samplesheet-input","text":"No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]'","title":"Samplesheet input"},{"location":"usage/#full-workflow-or-assembly-samplesheet","text":"The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline.","title":"Full workflow or assembly samplesheet"},{"location":"usage/#annotation-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline.","title":"Annotation only samplesheet"},{"location":"usage/#phylogenomics-and-pangenomics-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow.","title":"Phylogenomics and Pangenomics only samplesheet"},{"location":"usage/#reference-genome","text":"For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta","title":"Reference Genome"},{"location":"usage/#running-the-pipeline","text":"The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation.","title":"Running the pipeline"},{"location":"usage/#assembly-entry","text":"To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly Entry"},{"location":"usage/#assembly-qc-entry","text":"To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly QC Entry"},{"location":"usage/#annotation-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"Annotation Entry"},{"location":"usage/#poppunk-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"PopPUNK Entry"},{"location":"usage/#phylogenomics-and-pangenomics-entry","text":"To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker","title":"Phylogenomics and Pangenomics Entry"},{"location":"usage/#rspr-entry","text":"To execute the rSPR analysis on pre-existing trees: nextflow run beiko-lab/ARETE \\ -entry rspr \\ --input_sample_table samplesheet.csv \\ --core_gene_tree core_gene_alignment.tre \\ --concatenated_annotation BAKTA.txt \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --concatenated_annotation - The tabular annotation results (TSV) for all genomes, like the ones generated at the end of Prokka or Bakta in ARETE. Although useful, it's not necessary to execute the rSPR entry. --input_sample_table - A samplesheet containing all individual gene trees in the following format: gene_tree,path CDS_0000,/path/to/CDS_0000.tre CDS_0001,/path/to/CDS_0001.tre CDS_0002,/path/to/CDS_0002.tre CDS_0003,/path/to/CDS_0003.tre CDS_0004,/path/to/CDS_0004.tre","title":"rSPR Entry"},{"location":"usage/#evolccm-entry","text":"To execute the evolCCM analysis on a pre-existing reference tree and feature profile: nextflow run beiko-lab/ARETE \\ -entry evolccm \\ --core_gene_tree core_gene_alignment.tre \\ --feature_profile feature_profile.tsv.gz \\ -profile docker The parameters being: --core_gene_tree - The reference tree, coming from a core genome alignment, like the one generated by panaroo in ARETE. --feature_profile - A presence/absence TSV matrix of features in genomes. Genome names should be the same in the core tree and should be contained to a 'genome_id' column, with all other columns represent features absent (0) or present (1) in each genome. I.e.: genome_id plasmid_AA155 plasmid_AA161 ED010 0 0 ED017 0 1 ED040 0 0 ED073 0 1 ED075 1 1 ED082 0 1 ED142 0 1 ED178 0 1 ED180 0 0","title":"evolCCM Entry"},{"location":"usage/#recombination-entry","text":"To execute the recombination analysis on pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE \\ -entry recombination \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ -profile docker","title":"Recombination Entry"},{"location":"usage/#gene-order-entry","text":"To execute the Gene Order analysis on pre-existing assemblies and RGI annotations: nextflow run beiko-lab/ARETE \\ -entry gene_order \\ --input_sample_table gene_order_samplesheet.csv \\ -profile docker --input_sample_table - A samplesheet containing a fasta file, a genbank file and an RGI output file for each assembly: sample,fna_file_path,gbk,rgi SAMD00052607,SAMD00052607.faa,SAMD00052607.gbk,SAMD00052607_rgi.txt SAMEA1466699,SAMEA1466699.faa,SAMEA1466699.gbk,SAMEA1466699_rgi.txt SAMEA1486355,SAMEA1486355.faa,SAMEA1486355.gbk,SAMEA1486355_rgi.txt","title":"Gene Order Entry"},{"location":"usage/#updating-the-pipeline","text":"When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE","title":"Updating the pipeline"},{"location":"usage/#reproducibility","text":"It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.","title":"Reproducibility"},{"location":"usage/#core-nextflow-arguments","text":"NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).","title":"Core Nextflow arguments"},{"location":"usage/#-profile","text":"Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters","title":"-profile"},{"location":"usage/#-resume","text":"Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names.","title":"-resume"},{"location":"usage/#-c","text":"Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.","title":"-c"},{"location":"usage/#custom-resource-requests","text":"Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information.","title":"Custom resource requests"},{"location":"usage/#running-in-the-background","text":"Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).","title":"Running in the background"},{"location":"usage/#nextflow-memory-requirements","text":"In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g' Sometimes LevelDB, which is used by Nextflow to track execution metadata, can lead to memory-related issues, often showing as a SIGBUS error. This tends to happen when running Nextflow in SLURM environments . In this case, setting NXF_OPTS=\"-Dleveldb.mmap=false\" in your ~/.bashrc or immediately before executing nextflow run usually solves the issue.","title":"Nextflow memory requirements"},{"location":"usage/#aretes-storage-requirements","text":"ARETE generates a lot of intermediary files, which is even further exacerbated if you are running on a dataset with more than 100 genomes. Before running ARETE you should make sure you have at least 500 GB of free storage. After running ARETE and checking your results, you can remove the work/ directory in your working directory, which is where Nextflow stores its cache. Be aware that deleting work/ will make it so your pipeline won't re-run with cache when using the -resume flag, it will run every process from scratch.","title":"ARETE's storage requirements"}]}
\ No newline at end of file