Monotextor can be built from source.
Step-by-step Monotextor installation from source.
# if you are cloning from scratch:
git clone --recurse-submodules https://github.com/zuny26/monotextor.git
# otherwise:
git submodule update --init --recursive
These are some external tools that need to be in the path before installing the project. If you are using an apt-like package manager you can run the following commands line to install all these dependencies:
# mandatory:
sudo apt install git time python3 python3-venv python3-pip golang-go build-essential cmake libboost-all-dev liblzma-dev time curl pigz parallel
# optional, feel free to skip dependencies for components that you don't expect to use:
## wget crawler:
sudo apt install wget
## warc2text:
sudo apt install uchardet libuchardet-dev libzip-dev
## Heritrix, PDFExtract and boilerpipe:
sudo apt install openjdk-8-jdk
## PDFExtract:
## PDFExtract also requires protobuf installed for CLD3 (installation instructions below)
sudo apt install autoconf automake libtool ant maven poppler-utils apt-transport-https ca-certificates gnupg software-properties-common
If you are using a RPM based system, use these instead:
# mandatory:
sudo dnf install git time python-devel python3-pip golang-go cmake pigz parallel boost-devel xz-devel uchardet zlib-devel gcc-c++
## Moses Perl tokenizer
sudo dnf install perl-FindBin perl-Time-HiRes perl-Thread
## warc2text:
sudo dnf install uchardet-devel libzip-devel
Compile and install Monotextor's C++ dependencies:
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
# other prefix can be used, as long as 'bin' is in the PATH and 'lib' in LD_LIBRARY_PATH
make -j install
Optionally, it is possible to skip the compilation of the dependencies that are not expected to be used:
cmake -DSKIP_MGIZA=ON -DCMAKE_INSTALL_PREFIX=$HOME/.local .. # MGIZA is used for dictionary generation
# other dependencies that can optionally be skipped:
# WARC2TEXT, PREVERTICAL2TEXT, DOCALIGN, BLEUALIGN, HUNALIGN, BIROAMER, KENLM
Additionally, Monotextor uses giashard for WARC files preprocessing.
# build and place the necessary tools in $HOME/go/bin
go install github.com/paracrawl/giashard/...@latest
Furthermore, most of the scripts in Monotextor are written in Python 3. The minimum requirement is Python>=3.7.
Some additional Python libraries are required. They can be installed automatically with pip
. We recommend using a virtual environment to manage Monotextor installation.
# create virtual environment & activate
python3 -m venv /path/to/virtual/environment
source /path/to/virtual/environment/bin/activate
# install dependencies in virtual enviroment
pip3 install --upgrade pip
# monotextor:
pip3 install .
# additional dependencies:
pip3 install ./monocleaner && pip install ./kenlm --install-option="--max_order 7"
pip3 install ./bifixer
If you don't want to install all Python requirements in requirements.txt
because you don't expect to run some of Monotextor modules, you can comment those *.txt
in requirements.txt
and rerun Monotextor installation.
Heritrix is Internet Archive's web crawler. To use it in Monotextor, first download Heritrix from here and unzip the release.
# download
wget https://repo1.maven.org/maven2/org/archive/heritrix/heritrix/3.4.0-20210923/heritrix-3.4.0-20210923-dist.zip
unzip heritrix-3.4.0-20210923-dist.zip
To use heritrix, Java has to be installed and JAVA_HOME
environment variable must point to Java installation. HERITRIX_HOME
environment variable must be set to the path where heritrix was unzipped. Make sure that heritrix
binary is executable.
# configure
export JAVA_HOME=/path/to/jdk-install-dir
export HERITRIX_HOME=/path/to/heritrix-3.4.0-20210923-dist
chmod u+x $HERITRIX_HOME/bin/heritrix
Before running Monotextor with heritrix, Heritrix Web UI should be launched, specifying the username and the password. The URL will be https://localhost:8443
, unless specified otherwise.
# run
$HERITRIX_HOME/bin/heritrix -a admin:admin
Heritrix Web UI settings (URL and username:password), along with the installation directory should be passed to Monotextor via heritrixUser
, heritrixUrl
and heritrixPath
configuration parameters.
heritrixUser: "admin:admin"
heritrixUrl: "https://localhost:8443"
heritrixPath: "/path/to/heritrix-3.4.0-20210923-dist"
If you experience problems with these steps or want additional information please refer to this guide.
In Docker it is located at /home/docker/heritrix-3.4.0-20210923-dist
and is not running by default, i.e. it should be launched manually before executing Monotextor crawling with Heritrix.
CLD3 (Compact Language Detector v3), is a language identification model that can be used optionally during preprocessing. It is also a requirement for PDFExtract and Linguacrawl. CLD3 needs protobuf
to work, the instructions for installation are the following:
# Install protobuf from official repository: https://github.com/protocolbuffers/protobuf/blob/master/src/README.md
# Maybe you need to uninstall any other protobuf installation in your system (from apt or snap) to avoid compilation issues
sudo apt-get install autoconf automake libtool curl make g++ unzip
wget https://github.com/protocolbuffers/protobuf/releases/download/v3.18.1/protobuf-all-3.18.1.tar.gz
tar -zxvf protobuf-all-3.18.1.tar.gz
cd protobuf-3.18.1
./configure
make
make check
sudo make install
sudo ldconfig
Depending on the version of libboost that you are using given a certain OS version or distribution package from your package manager, you may experience some problems when compiling some of the sub-modules included in Monotextor. If this is the case you can install it manually by running the following commands:
sudo apt-get remove libboost-all-dev
sudo apt-get autoremove
wget https://boostorg.jfrog.io/artifactory/main/release/1.77.0/source/boost_1_77_0.tar.gz
tar xvf boost_1_77_0.tar.gz
cd boost_1_77_0/
./bootstrap.sh
./b2 -j4 --layout=system install || echo FAILURE
cd ..
rm -rf boost_1_77_0*