nlp_pipeline

This repo is a wrapper of different NLP libraries into a uniform pipeline object. That will make easy to use them all in one project or to compare them to each other.

You can use this pipeline to do the following:

Tokenization
Stemming
Lemmatization
POS tagging
morphological analysis
build embeddins on corpora

Now, the library that are supported are:

Installation

Clone the repo

git clone  https://github.com/binbin83/nlp_pipeline.git

Create a virtual environment

python3 -m venv path/to/venv/nlp_pipeline

Install the requirements (if you wan to use gpu, install requirements_gpu.txt)

pip install -r requirements.txt

Read the Notebooks of examples in the folder notebooks
Update the config file with your own paths and parameters
Run the pipeline

python3 main_nlp.py

or

python3 main_embeddings.py

NLP data

The pipelines:

StanzaNlpPipeline
SpacyNlpPipeline
HuggingfaceNlpPipeline
StanzaCoreNlpPipeline

Have nearly the same structure and the same methods. The results they return are the same. A dictionaries with the following keys:

'tokens': list of tokens
'lemmas': list of lemmas
'pos': list of pos tags
'morph': list of morphological analysis
'doc': the original doc object of the library

Speed

With used RTX A4000 GPU 8Go, apply the nlp pipeline on a 10 millions words corpus took:

~70 minutes for Stanza (GPU)
~20 minutes for Spacy trf (GPU)
~14 minutes for Spacy lg (11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16)

Embeddings

The embeddings can be buil with the following models: Word2vec, Fastext, Doc2vec, LDA, LSA, ELDA, and HDP

Todo

[ ] Add hugging models available to the embeddings pipeline. ie make possible to finetune CAMEMBERT embeddings on the data

[ ] Add hops parser to the options: https://github.com/hopsparser/hopsparser

[ ] Add unitests

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
notebooks		notebooks
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
logging_config.yaml		logging_config.yaml
main_embeddings.py		main_embeddings.py
main_nlp.py		main_nlp.py
nlp_pipeline.drawio.png		nlp_pipeline.drawio.png
requirements.txt		requirements.txt
requirements_gpu.txt		requirements_gpu.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlp_pipeline

Installation

NLP data

Speed

Embeddings

Todo

About

Releases

Packages

Languages

License

binbin83/nlp_pipeline

Folders and files

Latest commit

History

Repository files navigation

nlp_pipeline

Installation

NLP data

Speed

Embeddings

Todo

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages