-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle x plural forms for French #91
Comments
It’s hard to stem such words properly without overstemming words like roux and faux. Do you have any suggestion for how to do it? |
Maybe this needs a dictionary solution rather than an algorithmic one! Only seven For For words ending in |
Hello! How can we move this forward? |
Hello, I have an issue with the adjective 'français' which has no masculine singular form while the feminine form has. The steaming result using PostgreSQL 11 is the following : 'français' -> 'franc' I would expect they are all 'franc' or 'français'. I guess that a dictionary approach could help for such case such as the issue started here, or you prefer that I open a separate issue to track my issue ? Note : on PostgreSQL I can solve the issue by creating my own synonym dictionary and change the text search config to use this dictionary before snowball steam, but I believe it makes more sense to have this fixed on snowball side. |
I think
Oh so this means I could solve the issue with jeux, hiboux, etc? |
Yes you could create a synonym dictionary and have it used in the pipeline before the french_stem. I recommend reading the PostgreSQL documentation on textsearch dictionnaries for how to do that. The documentation is also available in french ;) But since your issue and mine are so basic as far as french is concerned I would much prefer to have it solved here. |
Please don't hijack a ticket with unrelated issues - open your own ticket if you have something to report. |
ping! how can I help solve this? |
The purpose of these stemmers is for use with Information Retrieval ("text search" in less formal terms) - stemming words essentially allows conflating different forms of the "same" word, which usually improves recall (as in https://en.wikipedia.org/wiki/Precision_and_recall). The risk is it tends to reduce precision - that's somewhat inherent as different forms of a word can convey subtle differences in meaning, but the more problematic cases are where forms of different words get conflated. For example, the original English Porter stemmer ("porter" in snowball) stemmed both "skies" and "skis" to "ski", so a search for "ski" would find a document which wasn't connected to skiing at all but actually about "skies" - that's much worse than if it didn't stem "skies" at all. (The snowball "english" stemmer addresses this and stems "skies" to "sky".) A stemmer for a human language is probably inevitably going to be imperfect, but it's good to keep in mind the end goal is improving retrieval rather than reducing every word to its linguistic root. So to "solve" this we need a concrete plan for how to stem plural nouns which end with "x" without adversely affecting other words which end with "x". It may be there just isn't a sensible way to do that, but then not stemming such words is a decent status quo - using a stemmer which doesn't stem such words is no worse than not using a stemmer at all. |
The |
There isn't currently such an exception list, though you obviously could add one. But the first thing we need is to come up with a proposed list of such replacements, and carefully check it for potential problems - for example, "baux" -> "bail" seems problematic, as "baux" is also the plural of "bau" (https://en.wiktionary.org/wiki/baux#French). Then we need a patch that implements those. And also such exceptions should all be added to the test vocabulary if they aren't already in it, so that we have good test coverage for the change. And finally the algorithm description on the website needs updating to match. |
|
@bkazez I already outlined what's needed in a comment just above yours. |
«We need a patch» doesn’t provide much guidance for people who don’t know the codebase 🙂 |
I didn't just say "we need a patch" though. The first thing we need is a workable plan for the change we're going to make. This needs to consider all the effects of the proposed change, to make sure we aren't making things worse for other cases (or if we are, that such unwanted consequences are definitely outweighed by the benefits). The patch itself is literally an implementation detail. But I should note that these algorithms are intended for use in text search systems, where stemming is a common way to improve recall. For the intended use, over-stemming is more problematic than under-stemming, so we tend not to stem in cases that are hard to resolve. (If you want to always reduce words to a root form then Snowball's stemming algorithms likely aren't the right answer as that isn't the purpose they are designed for.) |
I have never contributed to Snowball. I'm not a native French speaker, but I know some and am happy to start the exception list. Where do you store such a list for remote collaboration? Also, should I start such a list for #139 too or is that one different? |
Apologies, you did provide a list of steps to be taken. PostgreSQL full-text search system relies on this stemmer, and defining a custom dictionary or lexer is not possible in environments where you don’t have full control of the database server, so I still think this would be best fixed here. Again, I would limit the scope to a very small list of well-defined irregular plurals, to avoid overreaching and adding regressions. But I believe it is not controversial to want
I am not a C programmer so I could try but without guarantee of success.
This is such a rare word that a false positive does not worry me (but my uses cases tend to want more results in preference to fewer but more exact matches). I will consult some dictionaries and propose a list!
I wouldn’t want to change code without having tests!
Sure. I found the file to be edited at https://github.com/snowballstem/snowball-website/blob/master/algorithms/french/stemmer.tt ; I’ll just need someone to validate results, as I don’t have java installed and the repo doesn’t seem to have CI building a preview. |
@merwok Did you get anywhere with this?
Indeed - it may be fine for your case, but we do need to consider that a change will affect everyone using the stemmer, so we need to think about whether it's problematic for someone's situation. My thoughts on this case is that it's better to just leave "baux" alone. That follows the general principle that overstemming is much more harmful than understemming, and would leave things as they are with the existing stemmer code (and if you didn't use stemming at all). It seems "baux" is going to be a fairly rare word whether it's the plural of "bau" or "bail". |
I haven’t made a list yet, but am still interested in helping to improve this! Agreed on leaving |
Hello! I hope this is the right place to report a problem I found with an app that uses PostgreSQL 10's full-text search.
There is a class of French nouns that form their plural in
x
: jeux, hiboux, choux, aulx, baux, etc.Testing with PG and reading the doc at https://snowballstem.org/algorithms/french/stemmer.html make me think that these are not handled.
The text was updated successfully, but these errors were encountered: