PANcake

Pangenome construction pipeline

Function

Initialization: The script sets up some initial parameters, such as percent identity, segment length, and the number of threads to be used in subsequent processes.
Command Line Options: The script parses command line options to customize the behavior of the pipeline. These options include specifying input data, setting a run identifier, specifying the number of genomes, and more.
Parameter Validation: It checks whether the provided parameters are within valid ranges and provides default values for some parameters if they are not provided.
Input Sample Handling: It checks if the input data is a directory or a single file. If it's a directory, it combines the contents into a single file.
Sequence Partitioning: If the "multiple_chromosomes" flag is enabled, the script performs sequence partitioning. It calculates genetic distances between sequences, identifies communities, and analyzes each community separately.
Running the Pipeline: It sets up the required environment variables and runs a Snakemake pipeline. This pipeline takes care of various tasks, such as sorting data, running PGGB, the core pangenome graph building tool used in this workflow, and generating reports.
Output Generation: After the pipeline completes, it creates an output directory and stores the results there.

Setup and quick start

It is recommended to set up a new conda environment for PANcake.

Clone this repo into your working directory

git clone https://www.github.com/KyranWissink/pancake 
cd pancake

Set up the conda environment

conda env create -f environment.yml
conda activate pancake

Run bash run.sh for the parameters

PANcake Version: 0.8

Usage:

        bash run.sh [options] -i <input> 

Description:

        Pipeline for pangenome graph creation using pggb

Options:

Mandatory:

        -i --input-sample               input file(s) (fa | fa.gz | dir)

Optional:

        -r --runid                      name for the run. Will also name directories this.

        -mc --multiple-chromosomes      Use this parameter if the sample contains multiple chromosomes.

        -n --number-of-genomes          The number of genomes in the sample

        -p --percent-identity           The lowest similarity between all sequences in percentages

        -poa --poa-parameters           The partial order alignment parameters to use (asm5, asm10, asm20)

        -s --segment-length             Segment length for mapping [default: 10k]

        -t --threads                    Number of threads to use [default: 16]

View output at output/${runid}/

Author and affiliation

Kyran Wissink
Student Bioinformatics and Biocomplexity
Utrecht University
github.com/KyranWissink
[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
scripts		scripts
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml
functions.sh		functions.sh
multiqc_config.yaml		multiqc_config.yaml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PANcake

Pangenome construction pipeline

Function

Setup and quick start

Author and affiliation

About

Releases

Packages

Languages

KyranWissink/PANcake

Folders and files

Latest commit

History

Repository files navigation

PANcake

Pangenome construction pipeline

Function

Setup and quick start

Author and affiliation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages