Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German lemmatizer performance is bad? #1382

Open
Brentably opened this issue Apr 13, 2024 · 16 comments
Open

German lemmatizer performance is bad? #1382

Brentably opened this issue Apr 13, 2024 · 16 comments
Labels

Comments

@Brentably
Copy link

Hello! I'm currently trying to use Stanza's German lemmatizer for a project I'm working on. As far as I'm concerned, this should be on par with the most accurate publically available lemmatizers out there, if not the most.

However, I'm really confused by the poor German performance. I get the following results when lemmatizing:

möchtest => möchtessen (should be mögen)
Willst => Willst (should be wollen)
sagst => sagst (should be sagen)
Sage => Sage (should be sagen)
aß => aß (should be essen)
Sprich => Sprich (should be sprechen)

These are all top ~50 verbs in german and none of these inflections are crazy rare, so I'm really confused by the performance. I recently did some digging and found out that HDT should be more accurate, and it is, but the results are still unimpressive:

möchtest => möchtes (should be mögen)
Willst => Willst (should be wollen)
sagst => sagsen (should be sagen)
Sage => sagen (correct)
aß => assen (should be essen)
Sprich => sprechen (correct)

This gets 2/6 correct instead of 0/6, but ofc that's still really poor.

I recently found this website cooljugator: https://cooljugator.com/de and for instance, you can just search up a verb, either conjugated or infinitive, and it seems to have near perfect performance for all of these.

Can anyone explain or point me in the right direction?

I'm considering getting a bunch of data and trying to supplement performance with my own lookup table right now, but would rather not spend the few days of effort that would require.

Thanks!

@AngledLuffa
Copy link
Collaborator

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

One example which shows up in the training data with a different result is Sage. In each of the following sentences, the GSD training data has Sage -> Sage:

# text = Der Sage nach wurden die Nelken 1270 vom Heer des französischen Königs Ludwig IX.
# text = Die Sage, deren historischer Gehalt nicht zu sichern ist, hat insofern ätiologische Funktion.
# text = In den 1920er Jahren hatte er Kontakt mit Cornelia Bentley Sage Quinton, die als erste Frau in den USA ein größeres Kunstmuseum leitete.

One thought which occurs to me is that maybe the lemmatizer's model should have some input based on the POS tag given, whereas it currently doesn't use the POS except for the dictionary lookup. I wonder if that would help in terms of lemmatizing unknown words.

@Brentably
Copy link
Author

Main issue is that the training data just doesn't have those verbs in them. If we had some kind of lexicon available with expected lemmas, we could include that, but we don't have that AFAIK. I can do some digging for that if you don't have suggestions.

You mean like some better lookup data? TBH I was just going to scrape some stuff, but would be happy to send it along.

Also, pardon my naiveté but I'm just generally confused? Isn't this like state of the art for lemmatizers? Are the best lemmatizers all closed source, made in-house, or are there just not that many non-english lemmatizer-dependent applications? Is there another popular solution to this problem that I am ignorant to?

@AngledLuffa
Copy link
Collaborator

The performance was measured on the test portions of the datasets, so to the extent those are limited and don't really cover some important concepts, the test scores will also reflect that.

I don't know what the best German lemmatizer is, but I can take some time later this week or in a chat with my PI to figure out other sources of training data, and I think embedding the POS tags in the seq2seq model will likely help it know whether or not to use a verb style ending or noun style ending in a language such as German for unknown words

@AngledLuffa
Copy link
Collaborator

options for additional training data, from @manning

I think the two main choices are:
https://github.com/Liebeck/IWNLP.Lemmatizer
(uses Wikidict, probably good for future)
https://github.com/WZBSocialScienceCenter/germalemma
(says unmaintained).

I also have high hopes for using the POS as an input embedding to the seq2seq at least helping, but @manning points out that there are a lot of irregulars in German which may or may not be helped by such an approach.

I don't expect to get to this in the next couple days, but perhaps next week or so I can start in on it

@Brentably
Copy link
Author

I scraped some ~5000 words of data from a conjugation / declination website. They seem to be high quality.

@AngledLuffa
Copy link
Collaborator

That does sound like it could be a useful resource!

@Brentably
Copy link
Author

Sent you an email!

@AngledLuffa
Copy link
Collaborator

I started going through the lemma sheet you sent, thinking we could add that as a new lemmatizer model in the next version. (Which will hopefully be soon.)

One thing I came across in my investigation is a weirdness in the GSD lemmas for some words, but not all:

UniversalDependencies/UD_German-GSD#35

I also found some inconsistencies in the json you'd sent us. (Was that script in typescript?)

so for example, early on, words that translate as "few" and "at least" are included in the same lemma:

{
    "word": "wenig",
    "pos": "adj",
    "versions": [
      "weniger",
      "wenigen",
      "wenigem",
      "wenige",
      "weniges",
      "wenig",
      "minder",
      "mindesten"
    ]
  },

wenig and mindesten translate differently on google translate, and mindesten is treated as its own lemma in GSD:

Also treated differently in GSD: welches -> welcher, not welch, and the pos is DET

33      welches welcher DET     PRELS   Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel       37      obj     _       _

 {
    "word": "welch, -e, -er, -es",
    "pos": "pron",
    "versions": ["welch", "welche", "welcher", "welches", "welchen", "welchem"]
  },

There are some unusual POS in the data you sent us:

POS of NOUN for Mann, Mannes, Männer, Männern

{
    "word": "Mann",
    "pos": "der",
    "versions": ["Mann", "Mannes", "Manns", "Manne", "Männer", "Männern"]
  },

Also NOUN:

{
    "word": "Kind",
    "pos": "das",
    "versions": ["Kind", "Kindes", "Kinds", "Kinde", "Kinder", "Kindern"]
  },

Ambiguous is hard for us to resolve in an automated fashion:

{
    "word": "kein",
    "pos": "pron/art",
    "versions": ["kein", "keines", "keine", "keinem", "keinen", "keiner"]
  },

not sure what to do with:

  { "word": "nichts, nix", "pos": "pron", "versions": ["nichts", "nix"] },
  { "word": "nun, nu", "pos": "adv", "versions": ["nun", "nu"] },

another example of POS that isn't a UPOS:

  { "word": "Frage", "pos": "die", "versions": ["Frage", "Fragen"] },
  { "word": "Hand", "pos": "die", "versions": ["Hand", "Hände", "Händen"] },

If you can resolve these or suggest how to resolve them, we can include this in the lemmatizer. Certainly in terms of adding a long list of verb, noun, & adj conjugations & declensions, it would be quite useful to avoid future German lemmatizer mistakes.

@Brentably
Copy link
Author

Yes, the script was in typescript.

Is it necessary to have the part of speech on the data? I have an improved list that I also validated with LLMs and cleaned a decent amount, but I started forgoing getting the part of speech on there.

Sent another email with the new list.

@Brentably
Copy link
Author

Also, the "der" and "das" on the POS represents the gender, which is why it's just not marked as NOUN, btw

@Brentably
Copy link
Author

Anyway, if we need to add part of speech back, I suggest just running the data through claude or o1 to generate the parts of speech, which I'm happy to do. LMK however I can help!

Thanks

@AngledLuffa
Copy link
Collaborator

I haven't totally forgotten this thread...

I found this repo, or rather @manning sent it to me:

https://github.com/gambolputty/german-nouns?tab=readme-ov-file

It gets information from German Wiktionary and is pretty easy to convert to fake training data for our German lemmatizer. If you happen to have any knowledge of how to do the same thing with verbs, adjectives, etc, that would help a lot! Either way, I'll add the nouns to the default German lemmatizer and see if it helps.

There's also an improvement from UD 2.15, in which @dan-zeman updated a long list of German lemmas which had been written ambiguously. That fixed data should already be in the Stanza 1.10 lemmatizer for German.

@AngledLuffa
Copy link
Collaborator

this suggests to me that the equivalent verb data must be possible to reconstruct:

https://en.wiktionary.org/wiki/Category:German_verbs

@Brentably
Copy link
Author

german_inflection_to_roots.json
german_words_with_inflections.json

These are what I've scrapped and use for my app. They should work pretty well.

@AngledLuffa
Copy link
Collaborator

I made it so that the default German package now includes all of the nouns, verbs, adjectives, and adverbs found in German Wiktionary.

There are a couple issues still:

@AngledLuffa
Copy link
Collaborator

alright, i extracted more of Wiktionary by paying attention to the verb pages which only have "inflected" forms

>>> pipe("möchtest")
[
  [
    {
      "id": 1,
      "text": "möchtest",
      "lemma": "mögen",
      "upos": "VERB",
      "xpos": "VVFIN",
      "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
      "start_char": 0,
      "end_char": 8,
      "misc": "SpaceAfter=No"
    }
  ]
]

there's still the ß/ss question which I haven't thought about too much. if you have a strong opinion, lmk and i can implement that, too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants