-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word vectors for Komi_Zyrian? #2
Comments
I have no idea about specific vectors, but maybe multilingual ones would be better? Some of those are surprisingly good at even very low resource languages... |
I don't know the current situation, unfortunately, but in case you develop any tools for Komi please keep us posted! There are quite large amounts of Komi texts online, but I don't know if anyone has published word vectors anywhere. |
Hi @AngledLuffa, In the doid article below, you will find vectors for Komi-Zyrian and some other Uralic languages https://zenodo.org/records/7866456 The vectors have been lemmatized. |
@rueter Thanks, that's hopefully helpful. One thing I wonder about is the lemmatization of the vocabulary. Stanza does POS tagging first before lemmatizing, so we can use the POS tags for the lemmas. Is there a way to expand the vocab to include the original forms of the words as well? As it stands, it seems about 32% of the treebank is covered by the word vectors, and hopefully including the forms that were lemmatized would increase that and get better results overall. (Although then I suppose there might not be much POS signal in the vectors.) |
I suppose another reasonable option would be to lemmatize the text ourselves w/o POS following UralicNLP, as mentioned in the paper. https://github.com/mikahama/uralicNLP The coverage of the WV for the lemmas in this treebank is 64.5%, which is much more likely to be useful in my mind. Compare with the fasttext vectors for the language, which covered 57% but as indicated above were not actually any more accurate than randomly initializing our model. |
I spent a little time asking around, and the closest I found aside from Facebook's that was still available online was an embedding that needs lemmatization. This is kind of backwards for us, since we use POS to inform the lemmatization and POS needs the embeddings, but we could as a longer term project connect a library such as UralicNLP to take the place of our lemmatizer and then use the lemmatized embeddings in the tagger. Some other embedding sets for Komi are reported in the literature, but unfortunately I was unable to download them and haven't heard back from the authors. Another thought is that i found this project: https://github.com/hangyav/anchor-embeddings/ In this project, they start from a word embedding from a related, higher resourced language, hopefully at least a few million tokens in the LRL language, and a dictionary between the two. Given the language family, perhaps the closest option for the "anchor" language would be Finnish? Is it possible to get a Komi - Finnish dictionary and a collection of Komi tokens suitable for such an embedding? |
Do you have any suggestions for word vectors to use for Komi? I tried the fasttext Komi vectors with Stanza for POS and depparse, but it didn't make a huge difference. Tagger score went up 1 F1, but depparse score went down by 1 F1 instead.
The text was updated successfully, but these errors were encountered: