-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it normal that comparatives and superlatives are not stemmed? #172
Comments
I've moved this ticket because the question here is really about the code in the snowball repo (pystemmer is just a thin wrapper layer on top of this). This point doesn't seem to be explicitly covered in the algorithm documentation on the website, but I think these aren't done because the obvious rules for them would also trigger in cases where they'd be harmful. In the intended domain of use (generating index terms for information retrieval) overstemming (at least when it causes collisions between unrelated words) is much more problematic than understemming, so we tend to err on the side of understemming in such cases. For example, There is actually a rule to remove an If we were to add |
I had a look for past discussion of this and found Martin posted about -est in https://lists.tartarus.org/pipermail/snowball-discuss/2003-December/000548.html (just under 20 years ago!):
Similar comments also from Martin slightly more recently in https://lists.tartarus.org/pipermail/snowball-discuss/2009-November/001137.html :
It's true that these comparative and superlative endings are only used on shorter adjectives for which it seems impossible to implement a rule which isn't just a huge list of such adjectives, but there are a number of cases where this "short" overlaps with Snowball's R2 so doing that still seems worth considering even though it only addresses a minority of cases. Here's an analysis of the changes for the sample vocabulary for removing -est in R2 (adding
|
Looking at this again, I'm not so convinced that's the right conclusion - we really don't want to remove I looked at the slightly more restricted change of removing
|
The text was updated successfully, but these errors were encountered: