-
Initialization: The script sets up some initial parameters, such as percent identity, segment length, and the number of threads to be used in subsequent processes.
-
Command Line Options: The script parses command line options to customize the behavior of the pipeline. These options include specifying input data, setting a run identifier, specifying the number of genomes, and more.
-
Parameter Validation: It checks whether the provided parameters are within valid ranges and provides default values for some parameters if they are not provided.
-
Input Sample Handling: It checks if the input data is a directory or a single file. If it's a directory, it combines the contents into a single file.
-
Sequence Partitioning: If the "multiple_chromosomes" flag is enabled, the script performs sequence partitioning. It calculates genetic distances between sequences, identifies communities, and analyzes each community separately.
-
Running the Pipeline: It sets up the required environment variables and runs a Snakemake pipeline. This pipeline takes care of various tasks, such as sorting data, running PGGB, the core pangenome graph building tool used in this workflow, and generating reports.
-
Output Generation: After the pipeline completes, it creates an output directory and stores the results there.
It is recommended to set up a new conda environment for PANcake.
- Clone this repo into your working directory
git clone https://www.github.com/KyranWissink/pancake
cd pancake
- Set up the conda environment
conda env create -f environment.yml
conda activate pancake
- Run bash run.sh for the parameters
PANcake Version: 0.8
Usage:
bash run.sh [options] -i <input>
Description:
Pipeline for pangenome graph creation using pggb
Options:
Mandatory:
-i --input-sample input file(s) (fa | fa.gz | dir)
Optional:
-r --runid name for the run. Will also name directories this.
-mc --multiple-chromosomes Use this parameter if the sample contains multiple chromosomes.
-n --number-of-genomes The number of genomes in the sample
-p --percent-identity The lowest similarity between all sequences in percentages
-poa --poa-parameters The partial order alignment parameters to use (asm5, asm10, asm20)
-s --segment-length Segment length for mapping [default: 10k]
-t --threads Number of threads to use [default: 16]
- View output at output/${runid}/
Kyran Wissink
Student Bioinformatics and Biocomplexity
Utrecht University
github.com/KyranWissink
[email protected]