Jan Fillies | 3pc GmbH
Service for topic modeling. It is the process of learning, recognizing, and extracting these topics across a collection of documents. this service enables different algorithms to test topic modeling based on the data provided by a crawler service also included in the project. This is part of the Qurator project and aims to make huge archives easier to organize by organizing based on topics, also use cases in the story editor can be seen.
- Glove vocab. Download from: tbd. Put vocab.txt in topic-modeling\tm_service\resources
- Crawled data. Download from: tbd. Put vocab.txt in topic-modeling\tm_service\resources\crawled
- python -m spacy download de_core_news_md
- python -m spacy download en_core_web_md
#####Kmeans: standard clustering algorithm, still number of clusters needed. #####DBsearch: Advanced clustering algorithm nur cluster number but distance and min count per cluster needed. #####LSA Super fast but medium to poor results.
#####LDA: Slower and good results
#####LDA2VecModel Slow and better results. But no standard predict function.
#####HDP:
Good performance and average speed. Advantage is that the number of cluster is not needed. But finds more clusters in the data than might needed or expected.
LDA2vec can't operate on a small data set and needs a lot of time on a medium sized one, LDA underperformed LDA2vec on medium inout size but is able to process a larger frame in reasonable time. LDA2vec only works when the texts are longer, LDA performs okay on small texts too
For closer insight on algorithms see doc folder.
To see fully gets the logs make sure to:
Install the environment with:
conda env create -f environment.yml
Update the environment with:
conda env update -f environment.yml