Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmas with two options #35

Open
AngledLuffa opened this issue Aug 8, 2024 · 15 comments
Open

Lemmas with two options #35

AngledLuffa opened this issue Aug 8, 2024 · 15 comments

Comments

@AngledLuffa
Copy link

Came across a few words where the lemmas are apparently one of two options. This is a little inconvenient in terms of learning how to lemmatize German. Is there a way to unify these? For example, the ge- form of verbs is usually lemmatized without the ge, but for some of these examples it's allowing forms with or without ge to be the lemma.

# sent_id = train-s3771
# text = PDP - 11 - Rechner waren als Weiterentwicklung der PDP - 8 für die gleichen Einsatzzwecke gedacht und später in Gehäusen verfügbar, die nicht größer waren als die moderner PCs.
17      gedacht denken|gedenken VERB    VVPP    VerbForm=Part   0       root    _       _

# sent_id = train-s484
# text = Mir hat es bei Ihnen sehr gefallen.
7       gefallen        fallen|gefallen VERB    VVPP    VerbForm=Part   0       root    _       SpaceAfter=No

# sent_id = train-s495
# text = Das Dart - Spielen ist gesellig, das Bier schmeckt, man kommt mit den Gästen schnell ins Gespräch.
4       Spielen Spiel|Spielen   NOUN    NN      Case=Nom|Gender=Neut|Number=Sing        2       compound        _       _

# sent_id = train-s497
# text = Vom Wirt über Speisen und Preise.
5       Speisen Speise|Speisen  NOUN    NN      Case=Acc|Gender=Fem|Number=Plur 3       conj    _       _

# sent_id = train-s559
# text = Die Montage war tiptop und termingerecht.
2       Montage Montag|Montage  NOUN    NN      Case=Nom|Gender=Fem|Number=Sing 4       nsubj   _       _

... there are others aside from these

@amir-zeldes
Copy link

These look like ambiguous strings which could have either lemma if context is ignored, but the lemma is actually unambiguous in context. For example, the word "Montage" in the last example is ambiguous between "Mondays" and "mounting/assembly", but it is definitely the latter in context (the former would also have to be plural and the FEATS show you it isn't), so "Montage" (mounting) is the correct lemma in context.

@AngledLuffa
Copy link
Author

AngledLuffa commented Aug 9, 2024 via email

@amir-zeldes
Copy link

Yes, these can all be disambiguated IMO, but only some of them are trivial or close to. Spielen can only have the Lemma Spiel if it's dative plural, so that's easy.

Gedacht is non-trivial, though it's probably 99% denken. Cases of the verb gedenken usually have the rare genitive case object, but that's not 100% guaranteed. In practice, it's probably fine to say "denken unless it has a genitive dependent"?

Speisen is Lemma Speise if it's plural, otherwise it's Speisen.

Gefallen is maybe the hardest here since both verbs are not uncommon. I would say if it has an auxiliary with the lemma sein it's probably fallen, otherwise gefallen.

@dan-zeman
Copy link
Member

There are 246 such ambiguous lemma strings (in 791 instances). Ideally they should be disambiguated; but I'm afraid it means mostly manual work.

dan-zeman added a commit that referenced this issue Aug 22, 2024
dan-zeman added a commit that referenced this issue Aug 22, 2024
@AngledLuffa
Copy link
Author

AngledLuffa commented Aug 22, 2024 via email

@dan-zeman
Copy link
Member

Feel free to do cleaning on your end. In any case, you are training on the output of an old, pre-neural lemmatizer, you now that? (Although some of the data points have been checked manually, the dataset as a whole is still in the category "Lemmas: automatic".)

Picking the most likely one means you know what is the most likely one. In principle, you should answer that question 246 times, separately for each lemma string. I think in the end I will ignore the principle and try some heuristics that will target multiple lexemes at once. But I do not promise that the problem will disappear completely before the next release.

dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
dan-zeman added a commit that referenced this issue Aug 23, 2024
@dan-zeman
Copy link
Member

Down to 122 lemma types, 455 instances.

@AngledLuffa
Copy link
Author

Thanks, the progress here is very helpful.

In terms of automated lemmas... presumably there was some effort made to make those accurate? The goal is to memorize the known lemmas and try to predict the right lemma for a previously unseen word, a situation which makes the A|B lemmas rather distressing for our users.

dan-zeman added a commit that referenced this issue Aug 23, 2024
@AngledLuffa
Copy link
Author

Still a few of these in the dataset. For example, fallen|gefallen, bieten|gebieten, and fahren|führen among others in the dev set. Can I request some further help cleaning these up?

@amir-zeldes
Copy link

Had a quick look:

  • fallen: only in train-s9617 and train-s12662 - the remaining 9 cases are gefallen
  • gebieten: only in train-s1325, train-s8570 and train-s13494, otherwise bieten
  • führen - in all cases, never fahren in GSD

@AngledLuffa
Copy link
Author

Thank you! There are some similar ones such as entfahren|entführen and durchfahren|durchführen - do those generally follow the same rule?

@AngledLuffa
Copy link
Author

Also, there were a couple in test which was already labeled fahren. Are these correct?

# sent_id = test-s839
# text = Religion und Machtkämpfe führen in San Juan Chamula zur Vertreibung
1       Religion        Religion        NOUN    NN      Case=Nom|Gender=Fem|Number=Sing 4       nsubj   _       _
2       und     und     CCONJ   KON     _       3       cc      _       _
3       Machtkämpfe     Machtkampf      NOUN    NN      Case=Nom|Gender=Masc|Number=Plur        1       conj    _       _
4       führen  fahren  VERB    VVFIN   Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   0       root    _       _
# sent_id = test-s785
# text = So fuhren auch diese mit ihren Traktoren auf und blockierten ihrerseits Eisenbahnstrecken -- vornehmlich im Rhonetal.
1       So      so      ADV     ADV     _       2       advmod  _       _
2       fuhren  fahren  VERB    VVFIN   Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin   0       root    _       _

@amir-zeldes
Copy link

Yeah, all of those are *führen, not *fahren

@amir-zeldes
Copy link

Are these correct?

The first is wrong, should be führen. The second is correct.

@AngledLuffa
Copy link
Author

Thank you! There are still some more, but this is good progress

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants