docs: Add EvolCCM release package documentation

beiko-lab · Jun 12, 2024 · 30e59db · 30e59db
1 parent 26998fd
commit 30e59db
Show file tree

Hide file tree

Showing 2 changed files with 143 additions and 18 deletions.
diff --git a/assets/ParallelEvolCCM_supplement.tar.gz b/assets/ParallelEvolCCM_supplement.tar.gz
diff --git a/docs/evolccm.md b/docs/evolccm.md
@@ -30,6 +30,31 @@ ED178	0	1
 ED180	0	0
 ```
 
+## Using ParallelEvolCCM with ARETE
+
+The ParallelEvolCCM tool is also made available through the `evolccm` entry in ARETE.
+Making it possible to run the tool with Docker or Singularity.
+
+To execute the ParallelEvolCCM tool with ARETE, run the following command:
+
+```bash
+nextflow run beiko-lab/ARETE \
+  -entry evolccm \
+  --core_gene_tree core_gene_alignment.tre \
+  --feature_profile feature_profile.tsv.gz \
+  -profile docker
+```
+
+The parameters being:
+
+- `--core_gene_tree` - The reference tree, coming from a core genome alignment,
+  like the one generated by the `phylo` entry in ARETE.
+- `--feature_profile` - A presence/absence TSV matrix of features
+  in genomes, like the one created in ARETE's `annotation` entry.
+- `-profile` - The profile to use. In this case, `docker`.
+
+For more information, check the [full ARETE documentation](https://beiko-lab.github.io/arete/).
+
 ## Using ParallelEvolCCM by itself
 
 The ParallelEvolCCM tool is a command line tool written in R.
@@ -42,8 +67,17 @@ wget https://raw.githubusercontent.com/beiko-lab/arete/master/bin/ParallelEvolCC
 chmod +x ParallelEvolCCM.R
 ```
 
-Then, ensure all EvolCCM dependencies are installed.
-You can install them by running the following command in your R console:
+ParallelEvolCCM.R has several dependencies, which should automatically be installed
+the first time you run the script. You may need to install missing Linux packages using
+the following command:
+
+```bash
+sudo apt-get install libssl-dev libfontconfig1-dev libharfbuzz-dev \
+libfribidi-dev libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev \
+libopenblas-dev
+```
+
+If you prefer to install the dependencies manually, you can do so by running the following R commands:
 
 ```r
 install.packages(c('ape', 'dplyr', 'phytools', 'foreach', 'doParallel', 'gplots', 'remotes'))
@@ -62,27 +96,118 @@ You can then run the tool like this:
 
 Additional parameters can be found by running `./ParallelEvolCCM.R` with no additional parameters.
 
-## Using ParallelEvolCCM with ARETE
+## ParallelEvolCCM Release Package
 
-The ParallelEvolCCM tool is also made available through the `evolccm` entry in ARETE.
-Making it possible to run the tool with Docker or Singularity.
+We also provide test data that can be used to run EvolCCM along with some useful scripts
+for downstream analyses. These can be found in a .tar.gz file, which you can download like this:
 
-To execute the ParallelEvolCCM tool with ARETE, run the following command:
+```bash
+wget https://raw.githubusercontent.com/beiko-lab/arete/master/assets/ParallelEvolCCM_supplement.tar.gz
+tar -xzf ParallelEvolCCM_supplement.tar.gz
+```
+
+### Structure
+
+The tarball contains three subfolders:
+
+- `Scripts/` - The R scripts used to generate feature and statistical histograms, and the
+Python script that is used to build the GraphML files from PECCM output.
+
+- `100Bifido/` - The results of the 100-genome analysis described in the paper.
+
+- `1000Bifido/` - The results of the 1000-genome analysis described in the paper.
+
+Each of the results folders contains the following subdirectories and files. X is a placeholder for the size (i.e., 100 or 1000):
+
+- `SourceFiles/` - The source files.
+
+    - `Bifido_X.tre`: The Newick-formatted tree
+
+    - `Bifido_X_feature_profile`: The tab-separated feature file. **This is the input
+file for ‘PECCM BuildFeatureHistogram.R’, see usage below**
+
+- `Results/` - The results produced by PECCM and the helper scripts.
+
+    - `EvolCCM_Bifido_X.tre`: The tree used by EvolCCM (with midpoint rooting
+and multifurcating node resolution if necessary)
+
+    - `EvolCCM_Bifido_X_feature_profile.tsv`: A tab-separated file with the p-values
+and statistics for all pairwise comparisons between features. This is the
+input file for the scripts `PECCM_BuildStatHistogram.R`
+and `PECCM_Build_GraphML.py`, see below
+
+    - `EvolCCM_Bifido_X_feature_profile.tsv.pvals` and
+`EvolCCM_Bifido_X_feature_profile.tsv.X2`: Tab-separated matrices showing
+the p-values and X2 scores for all features.
+
+    - `EvolCCM_Bifido_100.graphml`: GraphML-formatted file with connections between features.
+
+    - Four .jpg files: 'a' is the output feature profile, and 'b', 'c', and 'd' are the
+feature and statistical distributions.
+
+### Additional Scripts
+
+First, in order to recreate the results of
+the 100-genome dataset, run the command  below (specifying any reasonable number of cores):
 
 ```bash
-nextflow run beiko-lab/ARETE \
-  -entry evolccm \
-  --core_gene_tree core_gene_alignment.tre \
-  --feature_profile feature_profile.tsv.gz \
-  -profile docker
+ParallelEvolCCM.R --intree Bifido_100.tre \
+  --intable Bifido_100_feature_profile.tsv \
+  --min abundance 0.05 --max abundance 0.95 --cores 8
 ```
 
-The parameters being:
+Four output files will be produced. All will be prefixed with 'EvolCCM' to distiguish
+them from the input files.
 
-- `--core_gene_tree` - The reference tree, coming from a core genome alignment,
-  like the one generated by the `phylo` entry in ARETE.
-- `--feature_profile` - A presence/absence TSV matrix of features
-  in genomes, like the one created in ARETE's `annotation` entry.
-- `-profile` - The profile to use. In this case, `docker`.
+- .tre file: The tree used by EvolCCM (with midpoint rooting and multifurcating
+node resolution if necessary). This file will end with a ‘.tre’ extension.
+
+- .tsv file: Statistics associated with the EvolCCM comparisons, with one line for
+each pairwise comparison.
+
+- .tsv.pvals file: A matrix showing the p-values from all-versus-all comparisons between features.
+
+- .tsv.X2 file: A matrix showing the X2 values from all-versus-all comparisons between features.
+
+---
+
+Next, `PECCM_BuildFeatureHistogram.R` can be used to generate the feature distribution
+histogram. Usage is:
+
+```bash
+Rscript PECCM_BuildFeatureHistogram.R infile
+```
+
+Where ‘infile’ is the input feature table (for example, Bifido_100_feature_profile.tsv).
+There are no other command-line options. A single .jpg file will be produced.
+
+---
+
+`PECCM_BuildStatHistogram.R` is used to generate the statistical summary histograms. Usage is:
+
+```bash
+Rscript PECCM_BuildStatHistogram.R infile
+```
+
+Where ‘infile’ is the input table of results (‘EvolCCM...tsv’).
+Three .jpg files will be produced.
+
+---
+
+`PECCM_Build_GraphML.py` is used to generate a graph from the pairwise comparisons,
+with optional p-value thresholding. You can also use the --attribute_name_length
+option to truncate attribute names for visual purposes.
+
+The optional ‘type_underscore’ argument will treat the first part of each attribute name
+(up to the first underscore) as its type: for example, ‘plasmid_ABC’ and ‘plasmid_def’
+would both be treated as objects of type ‘plasmid’, with names ‘ABC’ and ‘DEF’, respectively.
+
+Usage:
+
+```bash
+python ../PECCM_Build_GraphML.py \
+  --attribute_name_length 10 \
+  --type_underscore EvolCCM_Bifido_100_feature_profile.tsv \
+  EvolCCM_Bifido_100.graphml
+```
 
-For more information, check the [full ARETE documentation](https://beiko-lab.github.io/arete/).