Skip to content

Commit

Permalink
docs: Add EvolCCM release package documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
jvfe committed Jun 12, 2024
1 parent 26998fd commit 30e59db
Show file tree
Hide file tree
Showing 2 changed files with 143 additions and 18 deletions.
Binary file added assets/ParallelEvolCCM_supplement.tar.gz
Binary file not shown.
161 changes: 143 additions & 18 deletions docs/evolccm.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,31 @@ ED178 0 1
ED180 0 0
```

## Using ParallelEvolCCM with ARETE

The ParallelEvolCCM tool is also made available through the `evolccm` entry in ARETE.
Making it possible to run the tool with Docker or Singularity.

To execute the ParallelEvolCCM tool with ARETE, run the following command:

```bash
nextflow run beiko-lab/ARETE \
-entry evolccm \
--core_gene_tree core_gene_alignment.tre \
--feature_profile feature_profile.tsv.gz \
-profile docker
```

The parameters being:

- `--core_gene_tree` - The reference tree, coming from a core genome alignment,
like the one generated by the `phylo` entry in ARETE.
- `--feature_profile` - A presence/absence TSV matrix of features
in genomes, like the one created in ARETE's `annotation` entry.
- `-profile` - The profile to use. In this case, `docker`.

For more information, check the [full ARETE documentation](https://beiko-lab.github.io/arete/).

## Using ParallelEvolCCM by itself

The ParallelEvolCCM tool is a command line tool written in R.
Expand All @@ -42,8 +67,17 @@ wget https://raw.githubusercontent.com/beiko-lab/arete/master/bin/ParallelEvolCC
chmod +x ParallelEvolCCM.R
```

Then, ensure all EvolCCM dependencies are installed.
You can install them by running the following command in your R console:
ParallelEvolCCM.R has several dependencies, which should automatically be installed
the first time you run the script. You may need to install missing Linux packages using
the following command:

```bash
sudo apt-get install libssl-dev libfontconfig1-dev libharfbuzz-dev \
libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev \
libopenblas-dev
```

If you prefer to install the dependencies manually, you can do so by running the following R commands:

```r
install.packages(c('ape', 'dplyr', 'phytools', 'foreach', 'doParallel', 'gplots', 'remotes'))
Expand All @@ -62,27 +96,118 @@ You can then run the tool like this:

Additional parameters can be found by running `./ParallelEvolCCM.R` with no additional parameters.

## Using ParallelEvolCCM with ARETE
## ParallelEvolCCM Release Package

The ParallelEvolCCM tool is also made available through the `evolccm` entry in ARETE.
Making it possible to run the tool with Docker or Singularity.
We also provide test data that can be used to run EvolCCM along with some useful scripts
for downstream analyses. These can be found in a .tar.gz file, which you can download like this:

To execute the ParallelEvolCCM tool with ARETE, run the following command:
```bash
wget https://raw.githubusercontent.com/beiko-lab/arete/master/assets/ParallelEvolCCM_supplement.tar.gz
tar -xzf ParallelEvolCCM_supplement.tar.gz
```

### Structure

The tarball contains three subfolders:

- `Scripts/` - The R scripts used to generate feature and statistical histograms, and the
Python script that is used to build the GraphML files from PECCM output.

- `100Bifido/` - The results of the 100-genome analysis described in the paper.

- `1000Bifido/` - The results of the 1000-genome analysis described in the paper.

Each of the results folders contains the following subdirectories and files. X is a placeholder for the size (i.e., 100 or 1000):

- `SourceFiles/` - The source files.

- `Bifido_X.tre`: The Newick-formatted tree

- `Bifido_X_feature_profile`: The tab-separated feature file. **This is the input
file for ‘PECCM BuildFeatureHistogram.R’, see usage below**

- `Results/` - The results produced by PECCM and the helper scripts.

- `EvolCCM_Bifido_X.tre`: The tree used by EvolCCM (with midpoint rooting
and multifurcating node resolution if necessary)

- `EvolCCM_Bifido_X_feature_profile.tsv`: A tab-separated file with the p-values
and statistics for all pairwise comparisons between features. This is the
input file for the scripts `PECCM_BuildStatHistogram.R`
and `PECCM_Build_GraphML.py`, see below

- `EvolCCM_Bifido_X_feature_profile.tsv.pvals` and
`EvolCCM_Bifido_X_feature_profile.tsv.X2`: Tab-separated matrices showing
the p-values and X2 scores for all features.

- `EvolCCM_Bifido_100.graphml`: GraphML-formatted file with connections between features.

- Four .jpg files: 'a' is the output feature profile, and 'b', 'c', and 'd' are the
feature and statistical distributions.

### Additional Scripts

First, in order to recreate the results of
the 100-genome dataset, run the command below (specifying any reasonable number of cores):

```bash
nextflow run beiko-lab/ARETE \
-entry evolccm \
--core_gene_tree core_gene_alignment.tre \
--feature_profile feature_profile.tsv.gz \
-profile docker
ParallelEvolCCM.R --intree Bifido_100.tre \
--intable Bifido_100_feature_profile.tsv \
--min abundance 0.05 --max abundance 0.95 --cores 8
```

The parameters being:
Four output files will be produced. All will be prefixed with 'EvolCCM' to distiguish
them from the input files.

- `--core_gene_tree` - The reference tree, coming from a core genome alignment,
like the one generated by the `phylo` entry in ARETE.
- `--feature_profile` - A presence/absence TSV matrix of features
in genomes, like the one created in ARETE's `annotation` entry.
- `-profile` - The profile to use. In this case, `docker`.
- .tre file: The tree used by EvolCCM (with midpoint rooting and multifurcating
node resolution if necessary). This file will end with a ‘.tre’ extension.

- .tsv file: Statistics associated with the EvolCCM comparisons, with one line for
each pairwise comparison.

- .tsv.pvals file: A matrix showing the p-values from all-versus-all comparisons between features.

- .tsv.X2 file: A matrix showing the X2 values from all-versus-all comparisons between features.

---

Next, `PECCM_BuildFeatureHistogram.R` can be used to generate the feature distribution
histogram. Usage is:

```bash
Rscript PECCM_BuildFeatureHistogram.R infile
```

Where ‘infile’ is the input feature table (for example, Bifido_100_feature_profile.tsv).
There are no other command-line options. A single .jpg file will be produced.

---

`PECCM_BuildStatHistogram.R` is used to generate the statistical summary histograms. Usage is:

```bash
Rscript PECCM_BuildStatHistogram.R infile
```

Where ‘infile’ is the input table of results (‘EvolCCM...tsv’).
Three .jpg files will be produced.

---

`PECCM_Build_GraphML.py` is used to generate a graph from the pairwise comparisons,
with optional p-value thresholding. You can also use the --attribute_name_length
option to truncate attribute names for visual purposes.

The optional ‘type_underscore’ argument will treat the first part of each attribute name
(up to the first underscore) as its type: for example, ‘plasmid_ABC’ and ‘plasmid_def’
would both be treated as objects of type ‘plasmid’, with names ‘ABC’ and ‘DEF’, respectively.

Usage:

```bash
python ../PECCM_Build_GraphML.py \
--attribute_name_length 10 \
--type_underscore EvolCCM_Bifido_100_feature_profile.tsv \
EvolCCM_Bifido_100.graphml
```

For more information, check the [full ARETE documentation](https://beiko-lab.github.io/arete/).

0 comments on commit 30e59db

Please sign in to comment.