You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I hope this is the correct place to report this sort of thing. I am a librarian and work with Koha (the library system) and Elasticsearch and we use snowball_french as the stemmer for the search engine.
My go-to search is "chat" (means "cat", and as you can see from my avatar, I like cats 😸) and I noticed that it returns items titled "châtiment". "Chat" should not be the stem for "châtiment", it should be "châti". All the conjugations of the verb "châtier" should also stem to "châti".
The text was updated successfully, but these errors were encountered:
But it isn't! Currently "châtiment" is stemmed to "chât" (which an accent on the "a"). If elasticsearch or koha is stripping accents from snowball's stems then this part is a problem with whichever is doing that, not a problem with Snowball.
it should be "châti"
It looks like we currently map forms of "châtier" to either "chât" or "châti", at least mostly - I didn't test all the forms on the page you linked to but e.g. "châtions" is stemmed to "châtion", which I think is the probably an issue noted in https://snowballstem.org/algorithms/romance.html:
_"In French the verb endings ent and ons cannot be removed without unacceptable overstemming. The ons form is rarer, but ent forms are quite common, and will appear regularly throughout a stemmed vocabulary."_
Nothing unrelated seems to stem to either "chât" or "châti" (unless the output is further mangled, but that's outside our control), so we're just missing out on the opportunity to conflate some forms of a word here, which is pretty much inevitable for an algorithmic stemmer for a human language. Conflating only some forms is still going be be an improvement over not stemming at all.
We may be able to do better. If this only affects one verb it's probably not worth the complication, but if other verbs conjugate like "châtier" we can try to tweak the rules to handle them better without negatively affecting other cases.
Hello! I hope this is the correct place to report this sort of thing. I am a librarian and work with Koha (the library system) and Elasticsearch and we use snowball_french as the stemmer for the search engine.
My go-to search is "chat" (means "cat", and as you can see from my avatar, I like cats 😸) and I noticed that it returns items titled "châtiment". "Chat" should not be the stem for "châtiment", it should be "châti". All the conjugations of the verb "châtier" should also stem to "châti".
The text was updated successfully, but these errors were encountered: