-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
German stemmer possible improvements #161
Comments
Thanks for submitting this and sorry for taking an age to get to it. Some thoughts:
I think I need to look into this one more.
This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).
Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).
If system -> syst is the problematic case, maybe it would be better to prevent that happening instead? It's not conflating with another word like morgenstern, but I think it's good to consider if there's a better way to address this. I notice this appears to be due to -st before -em and the -stern case to be -st before -ern, but a simple restriction to only remove -em and -ern if not preceded by -st seems to affect cases we probably don't want to change. Are there any other
This seems good too. |
#85 was particularly motivated by job listings and wanted to conflate e.g. "Verkäufer" and "Verkäuferin" by stemming them to the same stem. That makes sense there but maybe in other contexts that's less helpful, and understemming is the safer option when unsure. On the "for #85" side, "Verkäuferin" and "Verkäufer" are arguably closer in meaning than "Verkäufer" is to other words we also currently stem to "verkauf" such as "verkaufen". Also the modern trend seems to be away from gendered language (an example from English is that "actor" tends to be used regardless of gender and "actress" gets used less) and assuming similar trends in German (which from #153 I gather is the case) that also tends to argue for the change from #85. I'm unsure what's best here, but having set down the points above I'm leaning towards the change #85. |
The -ers removal change only changes the stems for two words in the current
So I think this needs more investigation - if we can find more examples ending |
Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):
@OlgaGuselnikova That addresses all the "system" cases and should avoid the overstemming you mention. Were there any other cases this approach doesn't address? |
Previously we would overstem words ending -system. This change means we now conflate e.g. "system" and "systemen". See #161
I've gone ahead and merged this change. |
Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable. |
If we handle this removal in step 1 then we avoid conflating
I'm running a script to collate a larger (and perhaps more modern) German wordlist from a de.wikipedia.org dump - then we can see how looks for a more comprehensive vocabulary list (and see if |
Testing on a larger list, there are more cases where if we do |
This improves 82 cases in the current sample data without making anything worse. Tests on a larger word list look good too. Partly addresses #161
I had another look at this, and tried
This gives:
Of those, Overall this doesn't seem a worthwhile change. (I tried the extra There's only one motivating example for this change, and no feedback when I asked for it some time ago now. We could start an exception list of stems to not remove |
The change to remove @OlgaGuselnikova Do you have more examples of cases that the |
Hello, Snowball developers team!
I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.
Example (word - stem by Snowball demo - stem by customized algorithm):
Förderer - ford - ford
Förderers - forder - ford
Förderern - ford - ford
-erinnen is replaced with -erin
There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.
Example (word - stem by Snowball demo - stem by customized algorithm):
Politikerin - politikerin - politikerin
Politikerinnen - politikerinn - politikerin
Example (word - stem by Snowball demo - stem by customized algorithm):
morgenstern - morgen - morgen
morgensterne - morgenstern - morgen
That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".
Example (word - stem by Snowball demo - stem by customized algorithm):
system - syst - syst
systems - system - syst
Example (word - stem by Snowball demo - stem by customized algorithm):
artikel - artikel - artikel
artikeln - artikeln - artikel
We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.
Standart suffix algorithms with described above changes
Thanks you for your time!
The text was updated successfully, but these errors were encountered: