Word vectors for Komi_Zyrian? #2

AngledLuffa · 2024-11-17T01:37:54Z

Do you have any suggestions for word vectors to use for Komi? I tried the fasttext Komi vectors with Stanza for POS and depparse, but it didn't make a huge difference. Tagger score went up 1 F1, but depparse score went down by 1 F1 instead.

amir-zeldes · 2024-11-18T21:47:19Z

I have no idea about specific vectors, but maybe multilingual ones would be better? Some of those are surprisingly good at even very low resource languages...

nikopartanen · 2024-11-19T09:27:36Z

I don't know the current situation, unfortunately, but in case you develop any tools for Komi please keep us posted! There are quite large amounts of Komi texts online, but I don't know if anyone has published word vectors anywhere.

rueter · 2024-11-19T14:56:06Z

Hi @AngledLuffa,
Take a look at this article: https://aclanthology.org/2023.resourceful-1.3

In the doid article below, you will find vectors for Komi-Zyrian and some other Uralic languages

https://zenodo.org/records/7866456

The vectors have been lemmatized.

AngledLuffa · 2024-11-19T18:03:39Z

@rueter Thanks, that's hopefully helpful.

One thing I wonder about is the lemmatization of the vocabulary. Stanza does POS tagging first before lemmatizing, so we can use the POS tags for the lemmas. Is there a way to expand the vocab to include the original forms of the words as well? As it stands, it seems about 32% of the treebank is covered by the word vectors, and hopefully including the forms that were lemmatized would increase that and get better results overall. (Although then I suppose there might not be much POS signal in the vectors.)

AngledLuffa · 2024-11-19T21:38:53Z

I suppose another reasonable option would be to lemmatize the text ourselves w/o POS following UralicNLP, as mentioned in the paper. https://github.com/mikahama/uralicNLP

The coverage of the WV for the lemmas in this treebank is 64.5%, which is much more likely to be useful in my mind. Compare with the fasttext vectors for the language, which covered 57% but as indicated above were not actually any more accurate than randomly initializing our model.

AngledLuffa · 2024-12-13T21:59:48Z

I spent a little time asking around, and the closest I found aside from Facebook's that was still available online was an embedding that needs lemmatization. This is kind of backwards for us, since we use POS to inform the lemmatization and POS needs the embeddings, but we could as a longer term project connect a library such as UralicNLP to take the place of our lemmatizer and then use the lemmatized embeddings in the tagger.

Some other embedding sets for Komi are reported in the literature, but unfortunately I was unable to download them and haven't heard back from the authors.

Another thought is that i found this project:

https://github.com/hangyav/anchor-embeddings/

In this project, they start from a word embedding from a related, higher resourced language, hopefully at least a few million tokens in the LRL language, and a dictionary between the two. Given the language family, perhaps the closest option for the "anchor" language would be Finnish? Is it possible to get a Komi - Finnish dictionary and a collection of Komi tokens suitable for such an embedding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word vectors for Komi_Zyrian? #2

Word vectors for Komi_Zyrian? #2

AngledLuffa commented Nov 17, 2024

amir-zeldes commented Nov 18, 2024

nikopartanen commented Nov 19, 2024

rueter commented Nov 19, 2024

AngledLuffa commented Nov 19, 2024

AngledLuffa commented Nov 19, 2024

AngledLuffa commented Dec 13, 2024

Word vectors for Komi_Zyrian? #2

Word vectors for Komi_Zyrian? #2

Comments

AngledLuffa commented Nov 17, 2024

amir-zeldes commented Nov 18, 2024

nikopartanen commented Nov 19, 2024

rueter commented Nov 19, 2024

AngledLuffa commented Nov 19, 2024

AngledLuffa commented Nov 19, 2024

AngledLuffa commented Dec 13, 2024