This repo is a wrapper of different NLP libraries into a uniform pipeline object. That will make easy to use them all in one project or to compare them to each other.
You can use this pipeline to do the following:
- Tokenization
- Stemming
- Lemmatization
- POS tagging
- morphological analysis
- build embeddins on corpora
Now, the library that are supported are:
- Clone the repo
git clone https://github.com/binbin83/nlp_pipeline.git
- Create a virtual environment
python3 -m venv path/to/venv/nlp_pipeline
- Install the requirements (if you wan to use gpu, install requirements_gpu.txt)
pip install -r requirements.txt
- Read the Notebooks of examples in the folder notebooks
- Update the config file with your own paths and parameters
- Run the pipeline
python3 main_nlp.py
or
python3 main_embeddings.py
The pipelines:
- StanzaNlpPipeline
- SpacyNlpPipeline
- HuggingfaceNlpPipeline
- StanzaCoreNlpPipeline
Have nearly the same structure and the same methods. The results they return are the same. A dictionaries with the following keys:
- 'tokens': list of tokens
- 'lemmas': list of lemmas
- 'pos': list of pos tags
- 'morph': list of morphological analysis
- 'doc': the original doc object of the library
With used RTX A4000 GPU 8Go, apply the nlp pipeline on a 10 millions words corpus took:
- ~70 minutes for Stanza (GPU)
- ~20 minutes for Spacy trf (GPU)
- ~14 minutes for Spacy lg (11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16)
The embeddings can be buil with the following models: Word2vec, Fastext, Doc2vec, LDA, LSA, ELDA, and HDP
[ ] Add hugging models available to the embeddings pipeline. ie make possible to finetune CAMEMBERT embeddings on the data
[ ] Add hops parser to the options: https://github.com/hopsparser/hopsparser
[ ] Add unitests