Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German stemmer possible improvements #161

Open
OlgaGuselnikova opened this issue Dec 22, 2021 · 10 comments
Open

German stemmer possible improvements #161

OlgaGuselnikova opened this issue Dec 22, 2021 · 10 comments

Comments

@OlgaGuselnikova
Copy link

Hello, Snowball developers team!

I work in developing translation software. We use snowball algorithms in our product to find inflected forms of terms in texts. We have gathered feedback from our customers on German stemming algorithm and developed some changes.

  1. Remove ending -ers

Example (word - stem by Snowball demo - stem by customized algorithm):
Förderer - ford - ford
Förderers - forder - ford
Förderern - ford - ford

  1. Feminine nouns

-erinnen is replaced with -erin

There are already some discussions on feminine endings in German (#153, #85). We have opted out to let our customers to decide themselves how a gendered word in German should be translated to a different language. Our addition to the algorithm simply provides a way to stem plural feminine nouns and singular feminine nouns in the same manner.

Example (word - stem by Snowball demo - stem by customized algorithm):
Politikerin - politikerin - politikerin
Politikerinnen - politikerinn - politikerin

  1. Remove -stern

Example (word - stem by Snowball demo - stem by customized algorithm):
morgenstern - morgen - morgen
morgensterne - morgenstern - morgen

  1. Remove ending -em

That change does lead to ocassional overstemming. However, the word "systems" is often used in the CS and engineering terminology, so it is crucial for our customers to find words like "...system" when searching for "...systems".

Example (word - stem by Snowball demo - stem by customized algorithm):
system - syst - syst
systems - system - syst

  1. -ln replaced with -l

Example (word - stem by Snowball demo - stem by customized algorithm):
artikel - artikel - artikel
artikeln - artikeln - artikel

We have implemented those changes (including updating word lists), so if after discussion you find changes (or some of them) useful, I can create a PR.

Standart suffix algorithms with described above changes
 define standard_suffix as (
	do (
	[substring] R1 among(
		'ers'
		(
			delete
		)
            )
	)	
        do (
            [substring] R1 among(
		'erinnen'
		(
			 <- 'erin'
		)
                'em' 'ern' 'er' 
                (   delete
                )						
                'e' 'en' 'es' 
                (   delete
                    try (['s'] 'nis' delete)
                )
                's'
                (   s_ending delete
                )
            )
        )
        do (
            [substring] R1 among(
		'stern'
		(
		delete 
		)
                'en' 'er' 'est' 'em'
                (   delete
                )
                'st'
                (   st_ending hop 3 delete
                )
            )
        )
        do (
            [substring] R2 among(
                'end' 'ung'
                (   delete
                    try (['ig'] not 'e' R2 delete)
                )
                'ig' 'ik' 'isch'
                (   not 'e' delete
                )
                'lich' 'heit'
                (   delete
                    try (
                        ['er' or 'en'] R1 delete
                    )
                )
                'keit'
                (   delete
                    try (
                        [substring] R2 among(
                            'lich' 'ig'
                            (   delete
                            )
                        )
                    )
                )
            )
        )
	do (
            [substring] R1 among(
                'ln'
                (   <- 'l'
                )
	)
    )
)

Thanks you for your time!

@ojwb
Copy link
Member

ojwb commented Aug 21, 2023

Thanks for submitting this and sorry for taking an age to get to it.

Some thoughts:

Remove ending -ers

I think I need to look into this one more.

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

Remove -stern

Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).

Remove ending -em

If system -> syst is the problematic case, maybe it would be better to prevent that happening instead? It's not conflating with another word like morgenstern, but I think it's good to consider if there's a better way to address this.

I notice this appears to be due to -st before -em and the -stern case to be -st before -ern, but a simple restriction to only remove -em and -ern if not preceded by -st seems to affect cases we probably don't want to change. Are there any other

-ln replaced with -l

This seems good too.

@ojwb
Copy link
Member

ojwb commented Sep 20, 2023

-erinnen is replaced with -erin

This looks good (or we could apply the change from #85 to remove both -erinnen and -erin).

#85 was particularly motivated by job listings and wanted to conflate e.g. "Verkäufer" and "Verkäuferin" by stemming them to the same stem. That makes sense there but maybe in other contexts that's less helpful, and understemming is the safer option when unsure.

On the "for #85" side, "Verkäuferin" and "Verkäufer" are arguably closer in meaning than "Verkäufer" is to other words we also currently stem to "verkauf" such as "verkaufen". Also the modern trend seems to be away from gendered language (an example from English is that "actor" tends to be used regardless of gender and "actress" gets used less) and assuming similar trends in German (which from #153 I gather is the case) that also tends to argue for the change from #85.

I'm unsure what's best here, but having set down the points above I'm leaning towards the change #85.

@ojwb
Copy link
Member

ojwb commented Sep 21, 2023

The -ers removal change only changes the stems for two words in the current german/voc.txt:

  • "regiereres" changes stem from "regier" to "regi" which means it now conflates with "regieren", "regierer" etc which seems better (though it currently it conflates with "Regierung" which it wouldn't with the change).
  • "weltmeers" changes stem from "weltme" to "weltm" - this means it no longer stems the same way as other forms of "Weltmeer" (https://en.wiktionary.org/wiki/Weltmeer#German) which seems worse.

So I think this needs more investigation - if we can find more examples ending -ers and look at which are better and which worse maybe we can come up with a rule like this but with a condition (e.g. "remove '-ers' unless preceded by a vowel would work for the 3 examples we currently have) or in a different place in the algorithm.

@ojwb
Copy link
Member

ojwb commented Oct 27, 2023

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index cd303b15..7dfa9c62 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -84,7 +84,10 @@ backwardmode (
     define standard_suffix as (
         do (
             [substring] R1 among(
-                'em' 'ern' 'er'
+                'em'
+                (   not 'syst' delete
+                )
+                'ern' 'er'
                 (   delete
                 )
                 'e' 'en' 'es'

@OlgaGuselnikova That addresses all the "system" cases and should avoid the overstemming you mention. Were there any other cases this approach doesn't address?

ojwb added a commit to snowballstem/snowball-data that referenced this issue Nov 9, 2023
ojwb added a commit that referenced this issue Nov 9, 2023
Previously we would overstem words ending -system.  This change means
we now conflate e.g. "system" and "systemen".

See #161
@ojwb
Copy link
Member

ojwb commented Nov 9, 2023

Re the "-em" removal to help words ending "system" vs "systems", I wonder if a better approach is to suppress the removal of "-em" when the word ends "-system" (or is "system"):

I've gone ahead and merged this change.

@ojwb
Copy link
Member

ojwb commented Nov 9, 2023

-ln replaced with -l

This seems good too.

Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.

@ojwb
Copy link
Member

ojwb commented Oct 3, 2024

-ln replaced with -l

Looking more closely, I see one problematic case in our test vocabulary: "Keller" (cellar) and "Kellner" (waiter) are conflated by this change. It looks to me like this is because the latter originally meant something like "person who looks after a cellar" and the meaning has evolved, rather than this being a sign that this check is being added in the wrong place. It'd be nicer to avoid this and maybe there are other cases like this, but OTOH it improves many cases so making one worse might be acceptable.

If we handle this removal in step 1 then we avoid conflating Keller and Kellner without making anything else worse (at least in the sample vocabulary list snowball-data/german.voc.txt). This is the change I tested (removing -lns too means we still conflate rasseln and rasselns but doesn't affect anything else in the sample vocabulary):

--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -98,6 +98,9 @@ backwardmode (
                 's'
                 (   s_ending delete
                 )
+                'ln' 'lns'
+                (   <- 'l'
+                )
             )
         )
         do (

I'm running a script to collate a larger (and perhaps more modern) German wordlist from a de.wikipedia.org dump - then we can see how looks for a more comprehensive vocabulary list (and see if -lns is worthwhile - if it really affects just a single word that is probably not worthwhile).

ojwb added a commit to snowballstem/snowball-data that referenced this issue Oct 10, 2024
@ojwb
Copy link
Member

ojwb commented Oct 10, 2024

Testing on a larger list, there are more cases where if we do -ln -> -l then also doing -lns -> -l is useful, mostly nouns which happen to end -ln - that means we now generate a stem which isn't linguistically correct for these, but that's not a problem in the intended domain of use, the only concern is if that introduces unwanted conflation, and it seems in practice it doesn't. Pushing changes to implement that.

ojwb added a commit that referenced this issue Oct 10, 2024
This improves 82 cases in the current sample data without making
anything worse.  Tests on a larger word list look good too.

Partly addresses #161
ojwb added a commit to snowballstem/snowball-website that referenced this issue Oct 10, 2024
ojwb added a commit to snowballstem/snowball-data that referenced this issue Oct 11, 2024
ojwb added a commit that referenced this issue Oct 11, 2024
This conflates singular and plural female versions of nouns with
the male versions.

Fixes #85
Partly addresses #161
ojwb added a commit to snowballstem/snowball-website that referenced this issue Oct 11, 2024
@ojwb
Copy link
Member

ojwb commented Oct 11, 2024

Remove -stern

Maybe it would be better to not stem morgenstern instead? The current conflation of morgenstern and morgen seems wrong really (morning and morningstar are related concepts but different enough that conflation seems unhelpful).

I had another look at this, and tried

diff --git a/algorithms/german.sbl b/algorithms/german.sbl
index c0973c72..6de281f8 100644
--- a/algorithms/german.sbl
+++ b/algorithms/german.sbl
@@ -88,7 +88,11 @@ backwardmode (
                 (   not 'syst' // don't remove -em from words ending -system
                     delete
                 )
-                'ern' 'er'
+                'ern'
+                (   not ('st' R1) // don't remove -stern from morgenstern, etc
+                    delete
+                )
+                'er'
                 'erin' 'erinnen' // conflate female versions of nouns
                 (   delete
                 )

This gives:

A total of 21 words changed stem
* 10 words changed stem but aren't interesting
  1 merges of groups of stems:
  { morgenstern } + { morgensterne }
* 9 splits of groups of stems:
  { abend abende abendlichen abends | abendstern }
  { eisgeschwister | eisgeschwistern }
  { finster finstere finsteren finsteres | finstern }
  { geschwister | geschwistern }
  { höllengeister | höllengeistern }
  { leit leite leiten leiter leiterin leiters leitest | leitstern }
  { naturgeister | naturgeistern }
  { philister | philistern }
  { vorg | vorgestern }

Of those, morgenstern, abendstern and leitstern are all something-star cases which seem minor improvements; vorg doesn't seem to be a word; the other splits seem unhelpful. Also fenstern is now conflated with fensternische instead of with fenster+fensters, which seems slightly worse but reasonable.

Overall this doesn't seem a worthwhile change. (I tried the extra R1 check to try reduce unwanted changes but it's not really helping - without it there were about twice as many changes, but not worthwhile overall either.)

There's only one motivating example for this change, and no feedback when I asked for it some time ago now.

We could start an exception list of stems to not remove stern from, but I'm not convinced they're common enough to really justify it. My current conclusion is not to try to address this one, but I'm open to discussion, especially if there are more examples which fall into a pattern.

@ojwb
Copy link
Member

ojwb commented Oct 11, 2024

The change to remove -ers seems the wrong approach to me - the problem here is really that we're overstemming Förderer and Förderern - it would be better to stem those to forder rather than the current ford (which collides with a car brand), though I'm not sure how practical that is to resolve satisfactorily. Fighting overstemming with more overstemming seems problematic though.

@OlgaGuselnikova Do you have more examples of cases that the -ers change helps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants