Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated Wiki and ReadMe #26

Merged
merged 45 commits into from
Nov 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
54993d8
updated fevicon
sumit-walia Aug 13, 2024
edc6b51
Added installation steps in separate file
sumit-walia Oct 7, 2024
68cee2f
added tabs in docs
sumit-walia Oct 7, 2024
3c8ca45
improving aesthetics
sumit-walia Oct 7, 2024
3dd9cb5
minor change in install.md
sumit-walia Oct 7, 2024
bc4a8ee
added quickstart
sumit-walia Oct 7, 2024
cc5df26
added quickstart
sumit-walia Oct 7, 2024
a264847
added quickstart
sumit-walia Oct 7, 2024
1053f49
added construction page
sumit-walia Oct 7, 2024
e4fb18b
added utils page
sumit-walia Oct 7, 2024
00a1505
updated index page
sumit-walia Oct 7, 2024
79ee35d
updated logo
sumit-walia Oct 7, 2024
5a01631
updated fevicon
sumit-walia Oct 7, 2024
5ebe2b9
updated fevicon
sumit-walia Oct 7, 2024
bfe216e
updated fevicon
sumit-walia Oct 7, 2024
d7a71aa
updated workflows
sumit-walia Oct 7, 2024
ed18df0
updated workflows
sumit-walia Oct 7, 2024
7d44ab3
updated workflows
sumit-walia Oct 7, 2024
05462fc
updated workflows
sumit-walia Oct 7, 2024
2587028
updated workflows
sumit-walia Oct 7, 2024
47efcf7
updated workflows
sumit-walia Oct 7, 2024
d48b55f
updated workflows
sumit-walia Oct 7, 2024
2a5d862
updated workflows
sumit-walia Oct 7, 2024
4e9d021
updated workflows
sumit-walia Oct 7, 2024
1a7eaba
updated workflows
sumit-walia Oct 7, 2024
50dc8bf
updated workflows
sumit-walia Oct 7, 2024
873a357
updated README
sumit-walia Oct 7, 2024
81f171a
updated README
sumit-walia Oct 7, 2024
8e75e18
capnp support
sumit-walia Oct 7, 2024
b55be26
c++-10
sumit-walia Oct 7, 2024
ef49d10
c++-10
sumit-walia Oct 7, 2024
c94902f
fevicon
sumit-walia Oct 7, 2024
d3e0677
fevicon
sumit-walia Oct 7, 2024
1521ccb
updated install script
Oct 28, 2024
ae408ca
updated install script
Oct 28, 2024
7376e70
updated utils
Oct 28, 2024
7b40047
Merge branch 'base' of https://github.com/TurakhiaLab/panman into base
Nov 10, 2024
9d1e4ba
updated wiki
Nov 11, 2024
85631d5
updated wiki
sumit-walia Nov 11, 2024
247fc0e
updated readme
sumit-walia Nov 11, 2024
b01bf3f
construction methods
sumit-walia Nov 17, 2024
b857d49
construction methods
sumit-walia Nov 17, 2024
4acc474
updated navigation
sumit-walia Nov 17, 2024
537332a
updated navigation
sumit-walia Nov 17, 2024
d706bbd
updated navigation
sumit-walia Nov 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions .github/workflows/cmake.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ permissions:

jobs:
deploy:
name: Deploy Job
runs-on: ubuntu-latest
steps:
- name: Checkout Code
Expand All @@ -26,14 +27,21 @@ jobs:
key: ${{ github.ref }}
path: .cache

- name: build docs
- name: Install dependencies and build mkdocs
run: |
pip install mkdocs-material
pip install "mkdocs-material[imaging]"
mkdocs gh-deploy --force


- name: switch to gcc-10 on linux
run: |
sudo apt install gcc-10 g++-10
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 100 --slave /usr/bin/g++ g++ /usr/bin/g++-10 --slave /usr/bin/gcov gcov /usr/bin/gcov-10
sudo update-alternatives --set gcc /usr/bin/gcc-10

- name: install pre-reqs and build
run: |
sudo apt install -y git build-essential cmake wget curl zip unzip tar protobuf-compiler libboost-all-dev pkg-config
sudo apt install -y git build-essential cmake wget curl zip unzip tar protobuf-compiler libboost-all-dev pkg-config capnproto
chmod +x install/installationUbuntu.sh
sudo ./install/installationUbuntu.sh
- name: test
Expand Down
166 changes: 149 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,50 +1,182 @@
[license-badge]: https://img.shields.io/badge/License-MIT-yellow.svg
[license-badge]: https://img.shields.io/badge/License-MIT-yellow.svg
[license-link]: [https://github.com/TurakhiaLab/panman/LICENSE](https://github.com/TurakhiaLab/panman/blob/main/LICENSE)
[![License][license-badge]][license-link]
[![DOI](https://img.shields.io/badge/DOI-10.1101/2024.07.02.601807-blue)](https://doi.org/10.1101/2024.07.02.601807)
[![DOI](https://img.shields.io/badge/DOI-https://zenodo.org/records/12630607-blue)](https://zenodo.org/records/12630607)
[<img src="https://img.shields.io/badge/Install with-DockerHub-informational.svg?logo=Docker">](https://hub.docker.com/r/swalia14/panman)
[<img src="https://img.shields.io/badge/Submitted to-bioRxiv-critical.svg?logo=LOGO">](https://doi.org/10.1101/2024.07.02.601807)
[<img src="https://img.shields.io/badge/Build with-CMake-green.svg?logo=snakemake">](https://cmake.org)


# Pangenome Mutation Annotated Network (PanMAN)
<div align="center">
<img src="docs/images/logo.svg"/>
</div>

## Table of Contents
- [Overview of PanMANs and <i>panmanUtils</i>](#overview)
- [Installation and Usage](#install) ([Documentation](https://turakhia.ucsd.edu/panman/))

- [Introduction](#intro) ([Wiki](https://turakhia.ucsd.edu/panman/))
- [PanMANs](#panman)
- [<i>panmanUtils</i>](#panmanUtils)
- [Installation](#install)
- [Using Installation Script](#script)
- [Using Docker Image](#image)
- [Using DockerFile](#file)
- [PanMAN Construction](#construct)
- [Using provided dataset](#pangraph)
- [Using custom dataset](#custom)
- [<i>panmanUtils</i> functionalities](#function)
- [Contribute](#contributions)
- [Citing PanMAN](#cite_panman)

## <a name="overview"></a> Overview of PanMAN and <i>panmanUtils</i> <br>
### What is a PanMAN?
## <a name="intro"></a> Introduction
Here we provide an overview of PanMAN, <i>panmanUtils</i>, and its installation methods and usage. For more information please see our [Wiki](https://turakhia.ucsd.edu/panman/).
### <a name="panman"></a> What is a PanMAN?
PanMAN or Pangenome Mutation-Annotated Network is a novel data representation for pangenomes that provides massive leaps in both representative power and storage efficiency. Specifically, PanMANs are composed of mutation-annotated trees, called PanMATs, which, in addition to substitutions, also annotate inferred indels (Fig. 1b), and even structural mutations (Fig. 1a) on the different branches. Multiple PanMATs are connected in the form of a network using edges to generate a PanMAN (Fig. 1c). PanMAN's representative power is compared against existing pangenomic formats in Fig. 1d. PanMANs are the most compressible pangenomic format for the different microbial datasets (SARS-CoV-2, RSV, HIV, Mycobacterium. Tuberculosis, E. Coli, and Klebsiella pneumoniae), providing 2.9 to 559-fold compression over standard pangenomic formats.
<div align="center">
<div><b>Figure 1: Overview of the PanMAN data structure</b></div>
<img src="docs/images/panman.svg" width="500"/>
</div>

### <i><b>panmanUtils</b></i>
### <a name="panmanUtils"></a> <i>panmanUtils</i>
<i>panmanUtils</i> includes multiple algorithms to construct PanMANs and to support various functionalities to modify and extract useful information from PanMANs (Fig. 2).

<!-- #### PanMAN constrution

<div align="center">
<div><b>Figure 2: PanMAN construction pipeline using panmanUtils</b></div>
<img src="docs/images/construct.svg" width="500"/>
</div> -->

<div align="center">
<div><b>Figure 2: Overview of <i>panmanUtils</i>' functionalities</b></div>
<img src="docs/images/utility.svg" width="500"/>
</div>


## <a name="install"></a> Installation and Usage <br>
For information on pnamanUtils installation and usage, please see our documentation page available [here](https://turakhia.ucsd.edu/panman/)
## <a name="install"></a> Installation
### <a name="script"></a> Using installation script (requires sudo access)

**Step 0:** Dependencies
```bash
Git
```

**Step 1:** Clone the repository
```bash
git https://github.com/TurakhiaLab/panman.git
cd panman
```
**Step 2:** Run the installation script
```bash
chmod +x install/installationUbuntu.sh
./install/installationUbuntu.sh
```
**Step 3:** Run panmanUtils
```bash
cd build
./panmanUtils --help
```
### <a name="image"></a> Using Docker Image

To use <i>panmanUtils</i> in a docker container, users can create a docker container from a docker image, by following these steps

**Step 0:** Dependencies
```bash
Docker
```
**Step 1:** Pull the PanMAN docker image from DockerHub
```bash
docker pull swalia14/panman:latest
```
**Step 2:** Build and run the docker container
```bash
docker run -it swalia14/panman:latest
```
**Step 3:** Run panmanUtils
```bash
# Insider docker container
cd /home/panman/build
./panmanUtils --help
```

### <a name="file"></a> Using DockerFile
Docker container with preinstalled <i>panmanUtils</i> can also be built from DockerFile by following these steps

**Step 0:** Dependencies
```bash
Docker
Git
```
**Step 1:** Clone the repository
```bash
git https://github.com/TurakhiaLab/panman.git
cd panman
```
**Step 2:** Build a docker image
```bash
cd docker
docker build -t panman .
```
**Step 3:** Build and run docker container
```bash
docker run -it panman
```
**Step 4:** Run panmanUtils
```bash
# Insider docker container
cd /home/panman/build
./panmanUtils --help
```

## <a name="construct"></a> PanMAN Construction
Once the package is installed, PanMANs can be constructed from PanGraph [or GFA or MSA] and Tree topology (Newick format) using <i>panmanUtils</i>. Here we provide examples for constructing PanMANs from PanGraph (JSON) and custom dataset. Alternatively, users can follow the instructions provided in [wiki](https://turakhia.ucsd.edu/panman/) for other methods.
### Building PanMAN from PanGraph

**Step 1:** Check if `sars_20.json` and `sars_20.nwk` files exist in `test` directory.

**Step 2:** Run <i>panmanUtils</i> with the following command to build a panman from PanGraph:

```bash
cd $PANMAN_HOME/build
./panmanUtils -P $PANMAN_HOME/test/sars_20.json -N $PANMAN_HOME/test/sars_20.nwk -O sars_20
```
The above command will run <i>panmanUtils</i> program and build `sars_20.panman` in `$PANMAN_HOME/build/panman` directory.

### Building PanMAN from the custom dataset
Alternatively, users can provide custom PanGraph (JSON) and tree topology (Newick format) files to build a panman, using the following command

```bash
cd $PANMAN_HOME/build
./panmanUtils -P $PANMAN_HOME/test/example.json -N $PANMAN_HOME/test/example.nwk -O example
```
The above command will run <i>panmanUtils</i> program and build `example.panman` in `$PANMAN_HOME/build/panman` directory.

## <a name="function"></a> <i>panmanUtils</i> functionalities
<i>panmanUtils</i> provide various functionalities such as summary, [Raw sequence, MSA, VCF, GFA] extract, sub-netwrok pruning, and many more. Please refer to [wiki](https://turakhia.ucsd.edu/panman/) for detailed information. Here we provide usage syntax and examples for summary and VCF extract.

#### Summary extract
The summary feature extracts node and tree level statistics of a PanMAN, that contains a summary of its geometric and parsimony information.

* Usage Syntax
```bash
./panmanUtils -I <path to PanMAN file> --summary --output-file=<prefix of output file> (optional)
```
* Example
```bash
cd $PANMAN_HOME/build
./panmanUtils -I panman/sars_20.panman --summary --output-file=sars_20
```

#### Variant Call Format (VCF) extract
Extract variations of all sequences from any PanMAT in a PanMAN in the form of a VCF file with respect to <i>any</i> reference sequence (ref) in the PanMAT.

* Usage syntax
```bash
./panmanUtils -I <path to PanMAN file> --vcf -reference=ref --output-file=<prefix of output file> (optional)
```
* Example
```bash
cd $PANMAN_HOME/build
./panmanUtils -I panman/sars_20.panman --vcf -reference="Switzerland/SO-ETHZ-500145/2020|OU000199.2|2020-11-12" --output-file=sars_20
```


## <a name="contri"></a> Contribute <br>
We welcome contributions from the community to enhance the capabilities of PanMAN and <i>panmanUtils</i>. If you encounter any issues or have suggestions for improvement, please open an issue on [PanMAN GitHub page](https://github.com/TurakhiaLab/panman). For general inquiries and support, reach out to our team.

## <a name="cite_panman"></a> Citing PanMAN <br>
If you use the PanMANs or <i>panmanUtils</i> in your research or publications, we kindly request that you cite the following paper:
* Sumit Walia, Harsh Motwani, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia, "<i>Compressive Pangenomics Using Mutation-Annotated Networks</i>", bioRxiv 2024.07.02.601807; doi: [10.1101/2024.07.02.601807](https://doi.org/10.1101/2024.07.02.601807)

2 changes: 1 addition & 1 deletion docker/DockerFile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
FROM ubuntu:20.04

RUN apt update
RUN apt install -y git build-essential cmake wget curl zip unzip tar protobuf-compiler libboost-all-dev pkg-config
RUN apt install -y git build-essential cmake wget curl zip unzip tar protobuf-compiler libboost-all-dev pkg-config capnproto

WORKDIR /HOME

Expand Down
65 changes: 65 additions & 0 deletions docs/construction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# PanMAN Construction

Here, we will learn to build PanMAN from various input formats.

**Step 0:** The Steps below require <i>panmanUtils</i>, if not done so far, refer to [installation guide](install.md) to install <i>panmanUtils</i>. To check if <i>panmanUtils</i> is properly installed or not, run the following command, and it should execute without error
```bash
# enter into the panman directory (assuming $PANMAN directs to the panman repository directory)
cd $PANMAN_HOME
```
```bash
cd $PANMAN_HOME/build
./panmanUtils --help
```
### Building PanMAN from PanGraph

**Step 1:** Check if `sars_20.json` and `sars_20.nwk` files exist in `test` directory. Alternatively, users can provide custom PanGraph (JSON) and tree topology (Newick format) files to build a panman.

**Step 2:** Run <i>panmanUtils</i> with the following command to build a panman from PanGraph:

```bash
cd $PANMAN_HOME/build
./panmanUtils -P $PANMAN_HOME/test/sars_20.json -N $PANMAN_HOME/test/sars_20.nwk -O sars_20
```
The above command will run <i>panmanUtils</i> program and build `sars_20.panman` in `$PANMAN_HOME/build/panman` directory.

### Building PanMAN from GFA

**Step 1:** Check if `sars_20.gfa` and `sars_20.nwk` files exist in `test` directory. Alternatively, users can provide custom GFA and tree topology (Newick format) files to build a panman.

**Step 2:** Run <i>panmanUtils</i> with the following command to build a panman from GFA:

```bash
cd $PANMAN_HOME/build
./panmanUtils -G $PANMAN_HOME/test/sars_20.gfa -N $PANMAN_HOME/test/sars_20.nwk -O sars_20
```
The above command will run <i>panmanUtils</i> program and build `sars_20.panman` in `$PANMAN_HOME/build/panman` directory.

### Building PanMAN from MSA (FASTA format)

**Step 1:** Check if `sars_20.msa` and `sars_20.nwk` files exist in `test` directory. Alternatively, users can provide custom MSA (FASTA format) and tree topology (Newick format) files to build a panman.

**Step 2:** Run <i>panmanUtils</i> to build a panman from GFA using the following commands:

```bash
cd $PANMAN_HOME/build
./panmanUtils -M $PANMAN_HOME/test/sars_20.msa -N $PANMAN_HOME/test/sars_20.nwk -O sars_20
```
The above command will run <i>panmanUtils</i> program and build `sars_20.panman` in `$PANMAN_HOME/build/panman` directory.

### Building PanMAN from raw genome sequences
We provide scripts to construct <i>panmanUtils</i> inputs (PanGraph/GFA/MSA and Newick) from raw sequences (FASTA format), followed by building a panman.

!!!Note
This script uses various tools such as PanGraph tool, PGGB, MAFFT, and MashTree to build input PanGraph, GFA, MSA, and Tree topology files, respectively. The script is particularly designed to be used in the docker container build from either the provided docker image or the DockerFile (instructions provided [here](install.md)).

**Step 1:** Check if the `sars_20.fa` file exists in `test` directory. Alternatively, users can provide custom raw sequences (FASTA format) to build a panman.

**Step 2:** Run the following command to construct a panman from raw sequences.

```bash
cd $PANMAN_HOME/scripts
chmod +x build_panman.sh
./build_panman.sh pangraph/gfa/msa
```

Binary file added docs/images/icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/interactive_mode.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 13 additions & 12 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# Welcome to PanMAN Wiki
# <b>Welcome to PanMAN Wiki</b>
<div align="center">
<img src="images/logo.svg"/>
</div>

## <b>What are PanMANs?</b>
## What are PanMANs?
PanMAN or Pangenome Mutation-Annotated Network is a novel data representation for pangenomes that provides massive leaps in both representative power and storage efficiency. Specifically, PanMANs are composed of mutation-annotated trees, called PanMATs, which, in addition to substitutions, also annotate inferred indels (Fig. 2b), and even structural mutations (Fig. 2a) on the different branches. Multiple PanMATs are connected in the form of a network using edges to generate a PanMAN (Fig. 2c). PanMAN's representative power is compared against existing pangenomic formats in Fig. 1. PanMANs are the most compressible pangenomic format for the different microbial datasets (SARS-CoV-2, RSV, HIV, Mycobacterium. Tuberculosis, E. Coli, and Klebsiella pneumoniae), providing 2.9 to 559-fold compression over standard pangenomic formats.

<div align="center">
Expand All @@ -18,27 +18,33 @@ PanMAN or Pangenome Mutation-Annotated Network is a novel data representation fo
</div>


### <b>PanMAN's Protocol Buffer file format</b>
## PanMAN's Protocol Buffer file format
PanMAN utilizes Google’s protocol buffer (protobuf, [https://protobuf.dev/](https://protobuf.dev/)), a binary serialization file format, to compactly store PanMAN's data structure in a file. Fig. 3 provides the .proto file defining the PanMAN’s structure. At the top level, the file format of PanMANs encodes a list (declared as a repeated identifier in the .protof file) of PanMATs. Each PanMAT object stores the following data elements: (a) a unique identifier, (b) a phylogenetic tree stored as a string in Newick format, (c) a list of mutations on each branch ordered according to the pre-order traversal of the tree topology, (d) a block mapping object to record homologous segments identified as duplications and rearrangements, which are mapped against their common consensus sequence; the block-mapping object is also used to derive the pseudo-root, e) a gap list to store the position and length of gaps corresponding to each block's consensus sequence. Each mutation object encodes the node's block and nucleotide mutations that are inferred on the branches leading to that node. If a block mutation exists at a position described by the Block-ID field (int32), the block mutation field (bool) is set to 1, otherwise set to 0, and its type is stored as a substitution to and from a gap in Block mutation type field (bool), encoded as 0 or 1, respectively. In PanMAN, each nucleotide mutation within a block inferred on a branch has four pieces of information, i.e., position (middle coordinate), gap position (last coordinate), mutation type, and mutated characters. To reduce redundancy in the file, consecutive mutations of the same type are packed together and stored as a mutation info (int32) field, where mutation type, mutation length, and mutated characters use 3, 5, and 24 bits, respectively. PanMAN stores each character using one-hot encoding, hence, one "Nucleotide Mutations" object can store up to 6 consecutive mutations of the same type. PanMAN's file also stores the complex mutation object to encode the type of complex mutation and its metadata such as PanMATs' and nodes' identifiers, breakpoint coordinates, etc. The entire file is then compressed using XZ ([https://github.com/tukaani-project/xz](https://github.com/tukaani-project/xz)) to enhance storage efficiency.

<div align="center">
<img src="images/pb.svg" width="600" height="600"/><br>
<b>Figure 3: PanMAN's file format</b>
</div>

## <i><b>panmanUtils</b></i>
## <i>panmanUtils</i>
<i>panmanUtils</i> includes multiple algorithms to construct PanMANs and to support various functionalities to modify and extract useful information from PanMANs (Fig. 4).

<div align="center">
<img src="images/utility.svg" width="600" height="600"/><br>
<b>Figure 4: Overview of panmanUtils' functionalities</b>
</div>

### <b><i>panmanUtils</i> Video Tutorial</b>
## Video Tutorial
TBA

## <b>Contributions</b>
We welcome contributions from the community to enhance the capabilities of PanMAN and panmanUtils. If you encounter any issues or have suggestions for improvement, please open an issue on [PanMAN GitHub page](https://github.com/TurakhiaLab/panman). For general inquiries and support, reach out to our team.

## <b>Citing PanMAN</b>
If you use the PanMANs or panmanUtils in your research or publications, we kindly request that you cite the following paper:<br>
* Sumit Walia, Harsh Motwani, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia, "<i>Compressive Pangenomics Using Mutation-Annotated Networks</i>", bioRxiv 2024.07.02.601807; doi: [10.1101/2024.07.02.601807](https://doi.org/10.1101/2024.07.02.601807)

### <b>Installation</b>
<!-- ### <b>Installation</b>
panmanUtils can be installed using two different options, as described below: <br>
1. Installation script <br>
2. Docker
Expand Down Expand Up @@ -255,11 +261,6 @@ $ ./panmanUtils -I <path to PanMAN file> --aa-translations --output-file=<prefix
```
```
$ ./panmanUtils -I ecoli_10.panman --aa-translations --output_file=ecoli_10
```
``` -->

## <b>Contributions</b>
We welcome contributions from the community to enhance the capabilities of PanMAN and panmanUtils. If you encounter any issues or have suggestions for improvement, please open an issue on [PanMAN GitHub page](https://github.com/TurakhiaLab/panman). For general inquiries and support, reach out to our team.

## <b>Citing PanMAN</b>
If you use the PanMANs or panmanUtils in your research or publications, we kindly request that you cite the following paper:<br>
* Sumit Walia, Harsh Motwani, Kyle Smith, Russell Corbett-Detig, Yatish Turakhia, "<i>Compressive Pangenomics Using Mutation-Annotated Networks</i>", bioRxiv 2024.07.02.601807; doi: [10.1101/2024.07.02.601807](https://doi.org/10.1101/2024.07.02.601807)
Loading
Loading