- Download setup script to
work
directory: setup.sh
Support for Ubuntu >= 16.04 or CentOS >= 7 or Debian >= 9
- Go into
work
directory then execute setup script.
cd work
sudo bash setup.sh
- After setup finish, there will have
PDFExtract-2.0.jar
andPDFExtract.json
intarget
folder.
ls -l target/
- PDFExtract-2.0.jar
- PDFExtract.json
-
You can now run the PDFExtract-2.0.jar via command line using the instructions in README.md
-
PDFExtract project is now in
/work/setup-tmp/pdf-extract
directory. -
And java library for PDFExtract in
/work/setup-tmp/pdf-extract/target/
.
If using your distribution packages for libprotobuf
is not enough to compile cld3
, please, install it manually (you can follow bitextor
instructions: https://github.com/bitextor/bitextor#language-detector)
- Prequisite: KenLM must be installed or you may use KenLM from Moses.
- Set the KenLM or Moses path in pdf-extract config file in secgion
kenlm_path
"sentence_join" : "/home/user/sentence-join/sentence-join.py"
"kenlm_path" : "/home/user/kenlm/bin"
- Download the models for the language pairs that you want to process from
http://data.statmt.org/paracrawl/sentence-join/
- Set path with prefix for models (expected extensions forward.binlm and backward.binlm) in pdf-extract config file in section
sentencejoin_model
for each language
"sentencejoin_model" : "/home/user/models/en/opus",
All dependencies are included in the project folder.