(C) Chengze Shen
EMMA is an ensemble usage of MAFFT --add
(particularly, MAFFT
with the -linsi
option) on large datasets. On the MAFFT webpage, MAFFT-linsi --add
is accurate for adding sequences to an existing alignment but is only recommended for a few hundred of sequences. This project aims to scale MAFFT-linsi --add
to run on large datasets with hundreds of thousands of sequences with similar (and sometimes better) alignment accuracy.
- (NEW) Now will not exit after creating
main.config
for the first time (if directly runningemma.py [arguments]
without runningsetup.py
). In addition, now EMMA will detect if the binaries defined inmain.config
are runnable and notify users if some binaries have an error when executing. - Checkpoint system! Now you can resume from any point if a previous run was interrupted somehow (except for the HMMSearch step, currently in implementation).
- Now automatically detects input data type/molecule (
amino
,dna
, orrna
). - Now has a progress bar for all intermediate steps (for better progress tracking!).
Add checkpoint support.Still need HMMSearch step checkpoint system.Add more customizable configuration support as WITCH.Finish up the pipeline so it supports building an alignment from scratch and not relying on UPP output.
Given an input existing alignment
-
Construct a set of constraint sub-alignments from
$C$ : Decompose$C$ to sub-alignments using the UPP decomposition strategy but limit to sub-alignments with$|Q_i|$ sequences,$l\leq |Q_i|\leq u$ ($l,u$ are user-provided free parameters; default values are$l=10,u=25$ ). This step creates a set of subsets that can overlap. -
Define the set of sub-problems: Assign each query sequence
$q\in Q$ to the best-fitting sub-alignment from Step 1. The assignment is determined by first constructing HMMs on the sub-alignments and then selecting the HMM with the highest adjusted bitscore (see WITCH) for each$q$ . -
Run
MAFFT-linsi--add
on each sub-problem: For each sub-problem (i.e., a sub-alignment$C_i$ on set$S_i$ with assigned query sequences$Q_i$ ), construct an extended sub-alignment on$S_i\cup Q_i$ usingMAFFT-linsi--add
. -
Merge extended sub-alignments using transitivity: All extended sub-alignments are consistent with each other (see proof in the main paper) and can merge to the backbone alignment with transitivity (see SEPP/UPP). The merging produces the final alignment on
$S\cup Q$ .
- Currently accepted in WABI 2023.
- Currently published on Algorithms of Molecular Biology (https://doi.org/10.1186/s13015-023-00247-x).
EMMA was tested and benchmarked on the following systems:
- Red Hat Enterprise Linux Server release 7.9 (Maipo) with Python 3.9.16
- Ubuntu 22.04 LTS with Python 3.7.12
EMMA requires the usage of MAFFT
binaries. One is provided with the package (v7.490 2021/Oct/30), but the MAFFT
binaries in the user's $PATH
environment are prioritized. If you experience any difficulties running EMMA, please contact Chengze Shen ([email protected]).
python>=3.7
configparser>=5.0.0
dendropy>=4.5.2,<4.6.0
numpy>=1.15
psutil>=5.0
scipy>=1.1.0
tqdm>=4.0.0
# 1. Install via GitHub repo
git clone https://github.com/c5shen/EMMA.git
# 2. Install all requirements
cd EMMA
pip3 install -r requirements.txt
# 3. Use emma.py, -h to see allowed commandline parameters
python3 emma.py [-h]
I currently do not have a working macOS system at hand for software testing purposes, hence the compiled macOS binaries included in EMMA may not be executable right now (reported by some users).
- The most likely issue right now should be
FastTreeMP
(main.config
, defined under[Basic]
and[MAGUS]
asfasttreepath
). In the case this binary file is not working out, please download the source code from FastTree.c. Direct quote from the FastTree 2.1 webpage below. - After compilation, you can either put
FastTreeMP
under your$PATH
variable, or changefasttreepath
inmain.config
to point to your compiled FastTree executable. - The other possible culprit
mafft
, for which you can download or compile the executable from MAFFT for macOS official page. You can follow the same step above to change themafftpath
inmain.config
.
Quote for FastTree:
If you use a Mac or other platform not included above, download FastTree.c and run
gcc -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
(
gcc
is installed on many Mac OS X and Unix machines. If you use a Mac, you may need to install it from Xcode.gcc
is also available for virtually every platform.) Note that FastTree will try to use SSE2/SSE3 instructions to speed up some inner loops. This will not work on many Windows or Mac machines. If FastTree will not run, then try compiling it with this command instead:
gcc -DNO_SSE -O3 -finline-functions -funroll-loops -Wall -o FastTree FastTree.c -lm
We have also heard that the -finline-functions option can cause an error. You can omit this option.
If you want to build the multi-threaded "FastTreeMP," use
gcc -DOPENMP -fopenmp -O3 -finline-functions -funroll-loops -Wall -o FastTreeMP FastTree.c -lm
Scripts of the following examples can be found in example/run.sh
. You can run each scenario with
./run.sh [i] # i can be 1, 2, or 3
python3 emma.py -b [input alignment] -e [input tree] -q [unaligned sequences] -d [output directory] -o est.aln.fasta
python3 emma.py -b [input alignment] -q [unaligned sequences] -d [output directory] -o est.aln.fasta
# > the "backbone sequences" will be selected from inputs and aligned with default MAGUS
# > a tree will be created for the backbone alignment using FastTree2
python3 emma.py -i [input sequences] -d [output directory] -o est.aln.fasta