Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word vectors for Komi_Zyrian? #2

Open
AngledLuffa opened this issue Nov 17, 2024 · 6 comments
Open

Word vectors for Komi_Zyrian? #2

AngledLuffa opened this issue Nov 17, 2024 · 6 comments

Comments

@AngledLuffa
Copy link

Do you have any suggestions for word vectors to use for Komi? I tried the fasttext Komi vectors with Stanza for POS and depparse, but it didn't make a huge difference. Tagger score went up 1 F1, but depparse score went down by 1 F1 instead.

@amir-zeldes
Copy link

I have no idea about specific vectors, but maybe multilingual ones would be better? Some of those are surprisingly good at even very low resource languages...

@nikopartanen
Copy link
Contributor

I don't know the current situation, unfortunately, but in case you develop any tools for Komi please keep us posted! There are quite large amounts of Komi texts online, but I don't know if anyone has published word vectors anywhere.

@rueter
Copy link
Contributor

rueter commented Nov 19, 2024

Hi @AngledLuffa,
Take a look at this article: https://aclanthology.org/2023.resourceful-1.3

In the doid article below, you will find vectors for Komi-Zyrian and some other Uralic languages

https://zenodo.org/records/7866456

The vectors have been lemmatized.

@AngledLuffa
Copy link
Author

@rueter Thanks, that's hopefully helpful.

One thing I wonder about is the lemmatization of the vocabulary. Stanza does POS tagging first before lemmatizing, so we can use the POS tags for the lemmas. Is there a way to expand the vocab to include the original forms of the words as well? As it stands, it seems about 32% of the treebank is covered by the word vectors, and hopefully including the forms that were lemmatized would increase that and get better results overall. (Although then I suppose there might not be much POS signal in the vectors.)

@AngledLuffa
Copy link
Author

I suppose another reasonable option would be to lemmatize the text ourselves w/o POS following UralicNLP, as mentioned in the paper. https://github.com/mikahama/uralicNLP

The coverage of the WV for the lemmas in this treebank is 64.5%, which is much more likely to be useful in my mind. Compare with the fasttext vectors for the language, which covered 57% but as indicated above were not actually any more accurate than randomly initializing our model.

@AngledLuffa
Copy link
Author

I spent a little time asking around, and the closest I found aside from Facebook's that was still available online was an embedding that needs lemmatization.  This is kind of backwards for us, since we use POS to inform the lemmatization and POS needs the embeddings, but we could as a longer term project connect a library such as UralicNLP to take the place of our lemmatizer and then use the lemmatized embeddings in the tagger.

Some other embedding sets for Komi are reported in the literature, but unfortunately I was unable to download them and haven't heard back from the authors.

Another thought is that i found this project:

https://github.com/hangyav/anchor-embeddings/

In this project, they start from a word embedding from a related, higher resourced language, hopefully at least a few million tokens in the LRL language, and a dictionary between the two.  Given the language family, perhaps the closest option for the "anchor" language would be Finnish?  Is it possible to get a Komi - Finnish dictionary and a collection of Komi tokens suitable for such an embedding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants