-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Czech stemmer #151
base: master
Are you sure you want to change the base?
Merge Czech stemmer #151
Conversation
do ( | ||
gopast non-v setmark pV | ||
gopast non-v gopast v setmark p1 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the usual Snowball definition of R1 (https://snowballstem.org/texts/r1r2.html) - I tried changing this to keep RV as above but make R1 match the usual definition:
do ( gopast non-v setmark pV )
do ( gopast v gopast non-v setmark p1 )
And I made a file with all the example Czech words mentioned in the paper, plus examples of all the noun and adjective declinations: czech-words.txt
Comparing the output of ./stemwords -l cs < czech-words.txt
using this custom R1 definition vs the standard R1 definition it seems pretty clear the standard R1 does a much better job for at least these cases.
@jimregan Can you remember the origin of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I can't remember.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may have worked out why - consider e.g. vlna
(wool) which has declension vlna
, vlny
, vlne
, vlnu
, vlne
, vlnou
- the aim would be to conflate these to the same stem (likely vln
). The normal Snowball R1 definition relies on vowels to measure the stem which doesn't work for these consonant clusters.
https://en.wikipedia.org/wiki/Czech_language#Consonants says:
The consonants /r/, /l/, and /m/ can be syllabic, acting as syllable nuclei in place of a vowel. Strč prst skrz krk ("Stick [your] finger through [your] throat") is a well-known Czech tongue twister using syllabic consonants but no vowels.
And https://en.wikipedia.org/wiki/Czech_phonology#Consonant_chart says:
Sonorants /r/, /l/ become syllabic between two consonants or after a consonant at the end of a word.
and https://en.wikipedia.org/wiki/Czech_phonology#Phonotactics:
The syllabic nucleus is usually formed by vowels or diphthongs, but in some cases syllabic sonorants (/r/ and /l/, rarely also /m/ and /n/) can be found in the nucleus, e.g. vlk [vl̩k] ('wolf'), krk [kr̩k] ('neck'), osm [osm̩] ('eight').
So I wonder if the R1 definition for Czech needs to consider r
and l
(and perhaps m
and n
?) preceded by a consonant and not followed by a vowel as effectively implying a vowel.
@ojwb, @jimregan Hi Guys, what is state of this PR? can we help with it somehow? Another implementation of czech snowball stemmer can be found here: https://www.fit.vut.cz/research/product/133/.en (GNU GPL) |
@jan-zajic It needs the points above resolving, but I think that's just a case of me finding the time to. I'm trying to clear the backlog of Snowball tickets, so hopefully soon. We couldn't really merge a GNU GPL stemmer as currently Snowball has a BSD-style licence - moving to a mixed licence situation would make things harder to understand and manage for users. From a quick look this other stemmer appears to the usual R1 definition (which is good), but it is quite a lot more complex (which is bad unless it does a better job as a result). Do you know how it compares in effectiveness to the one in this PR? If it's better, do you know if the copyright holders might consider relicensing it for inclusion in Snowball releases? |
fcc1b03
to
3afee58
Compare
I had a bit more of a look at the GPL snowball stemmer. I noticed the However fundamentally we can't use this implementation without an agreement to relicense. The source download has So I went back to looking at the Dolamic stemmer. Comparing the snowball implementation with the Java implementations http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt and http://members.unine.ch/jacques.savoy/clef/CzechStemmerAggressive.txt I spotted some inconsistencies (code snippets in same snowball/light/aggressive) order:
vs
vs
Note that the light stemmer comment says There's another inconsistency in
vs
vs
Here the comments I tried changing the first case in the snowball code and the differences look plausible but unfortunately I don't know the Czech language to a useful extent. I didn't try the second case yet. @jimregan @jan-zajic Any thoughts? |
Here's a scripted analysis of the effects of the various changes to palatalise I covered above:
|
There's one remaining inconsistency I've spotted, this one's in Here the light stemmer removes The older version of the light stemmer listed in the original paper removes all four suffixes. Changing to removing all 4 gives:
|
The Java code removes this ending but it was missing from the Snowball version. Looking at the changes resulting from this, it seems a clear improvement so I've concluded it was an accidental omission. See snowballstem#151
Three more notes: Comparing the code I noticed that I also noticed that there's a bug in the Java versions in one group of palatalise rules:
Here we check The final thing I noticed is that the Snowball version applies the palatalise step rather differently to the Java versions. E.g. consider
This changes
Almost every case is handled like this in snowball, except for
The
That at least makes things more similar, but fundamentally it seems the palatalise step in snowball will be much less effective as the final character will often have already been removed. The code in the paper (which seems pseudo-code for an earlier version of the light stemmer) removes the vowels like the snowball version does, then unconditionally performs (This also may mean that the conclusions in the paper about the light vs aggressive stemmers may not entirely apply to the Java versions we have access to, but in the absence of a comparison of the Java versions going with the light stemmer still seems sensible.) |
A further difference is that in the snowball implementation if It looks like this could be a deliberate change, as the snowball code does However, the cursor doesn't get reset before We can fix just the latter with |
Any progress on this issue? As we understand there is some kind of analysis comparison between two implementations -- one of which cannot be used anyways because of licensing and there are some tradeoffs on both sides? Maybe the original (simpler?) contributed algorithm (with acceptable license) is good enough? Can we somehow help to move this forward? I reviewed the issues above and at this moment they are too technical for me (not familiar with stemming problem domain), but maybe I could provide a feedback on something as a Czech speaker. |
Progress stalled on needing input from someone who knows Czech reasonably well. I thought I'd found someone who could help (this was probably late 2023/early 2024) but they never got back to me and I failed to chase it up. If you're a Czech speaker and wanting to get this resolved, that would definitely be useful.
There is a GPL implementation of a different algorithm mentioned above, which indeed would need relicensing as Snowball uses a 3-clause BSD licence. That one would also need to be rewritten in Snowball as well as relicensed. However the comparisons are against a Java implementation that's meant to be of the same algorithm (and this Java implementation is 2-clause BSD so compatible, see: http://members.unine.ch/jacques.savoy/clef/).
We don't want to just merge something with unresolved issues because that's likely to need significant changes later, and those are disruptive in typical users of these stemmers (because you need to rebuild your whole search database).
I'll need to review the discussion as it's been 9 months, but I think we should be able to resolve this together. |
Ok thanks for clarification. Count me in if you need help. |
@hauktoma Great. There are a few points to resolve, so I'll cover one at a time. The first question is really about syllables in Czech. I'll try to give some background to what we're doing and why. If you don't follow please say and I can clarify. (I'm also happy to do this on chat or a video or phone call if you think it would be easy to do it interactively.) We want to avoid the stemming algorithm removing suffixes too aggressively and mapping words to the same stem which aren't actually related (or are somewhat related but really have too different a meaning). Most of the Snowball stemmers make use of simple idea to help this which is to define regions at the end of the word from which a suffix can be removed. For most languages these are defining by counting the number of spans of vowel, then of non-vowel, etc - https://snowballstem.org/texts/r1r2.html shows some examples. As well as R1 and R2 there's also an RV for some languages which that page doesn't mention. This is essentially approximating counting syllables, while the original Czech stemming algorithm this implementation is based on used a cruder character-counting approach instead. In his original Snowball implementation jimoregan essentially retrofitted use of R1 and RV which I think was a good idea. However it seems in Czech that clusters of just consonants can form a syllable, so probably our R1 and RV definitions for Czech ought to take that into account. See my comment above for what led me to this conclusion, but the key point is this quote:
And the actual question is for the purposes of determining these regions, should we consider And if so, should |
To be honest I am not entirely sure about the idea handling the I'll try to sum the points up here and then provide examples at the end:
My betting/statistical impression is that implementing this may have more negative effect than positive one. Especially for the @ojwb can you please review my reasoning about this and provide feedback whether it is correct? If you think this may be worth a bit more investigating or that the examples provided below are not good enough to make a decision, I can try to consult some colleagues or dig some more formal materials about this. @ojwb maybe one quick question and clarification: you mentioned R1, which means that by default the stem approximation default algorithm for language (unless specified otherwise by knowing language and implementing it differently) is to remove one suffix? R2 means remove two suffixes? Can the number of suffixes removed be variable under certain conditions? What is the setting/strategy for Czech (R1 or R2) and where it came from? Note: have no problem with discussing this real-time on some call but maybe keep it as an option when we hit wall on something or some complex clarification will be needed. As a total layman in stemming/linguistics I am not sure if I would be able to have a real-time conversation on this topic. But if you get feeling that explaining something would be too much trouble in written/async form, let's do it. Example of more complex word for
|
Hi @ojwb, @hauktoma, The current discussion in this thread is beyond my time and expertise, so I decided to try to contact and find experts from I will try to reach people who could help more with this topic and I will let you know how it turned out. I think that if there is support for the Czech language in Snowball, it must be done as best as possible, since the impact will be great on a large number of open source projects and solutions above them. |
Thanks. I need to work through this in detail, but a couple of notes:
I think we'd probably just do something like work left to right (or perhaps right to left if that turns out to work better) and if a consonant is determined to be a syllabic consonant then it would not be regarded as a consonant for the letter which follows.
No, they're just different regions, and the region which is appropriate for each suffix is chosen based on considering the language's structure, and also empirically what seems to work better. It's typically better to lean towards being conservative in when to remove since overstemming is more problematic than understemming.
There are often conditions on whether a particular suffix is removed, and there's often an order suffixes are considered in, so removing one suffix may expose another that can then be removed too. I think jimregan came up with the current region setting for Czech, presumably based on the Java implementation's cruder character counts. |
I think trying to resolve some of the simpler points above will help us resolve the others, as they're somewhat interconncted (if nothing else it'll be some progress!)
I tried comparing the CzechStemmerLight java stemmer as downloaded and with this fix applied:
|
I've compiled a list of things to resolve at the top of the ticket.
Testing strongly shows
Changing the Snowball implementation makes no difference here (probably due to the oddness around when to remove a character vs calling |
I noticed another oddity in This version leaves the first character of a removed suffix behind when calling
This means Testing changing this to handle these suffixes like others where we call |
To check there weren't any further discrepancies between the Java and Snowball versions, I tried adjusting the Snowball version to use the same stem-length checks as the Java code (with the various fixes) instead of R1 and RV:
Doing this, I found we can split palatalise to simplify things. The main point of note though is
instead of
The difference is that the former will remove the longest of the suffixes that is in R1, while the latter will find the longest of the suffixes and only remove it if it is in R1 (e.g. Need to actually test which works better, but the former is what the Java code does. Update: Testing show
|
I've been looking at using the palatalise approach from the previous comment with R1 based on vowels. It causes a lot of changes, the vast majority for the better:
Based on the above, it seems clear we should adjust palatalise in this way, but then to take a look at the splits and see if we can eliminate most of them. |
This has been on the web site since 2012, but never actually got included in the code distribution.
This helps avoid overstemming. Co-authored-by: Jim O’Regan <[email protected]>
The "aggressive" version is known to overstem. According to the original paper, the aggressive version performs slightly better, but the difference isn't statistically significant and conflation from overstemming can be problematic. Co-authored-by: Jim O’Regan <[email protected]>
The Java implementation removes the latter but has incorrect comments saying it removes the former. Changing the Snowball implementation makes no difference here (probably due to the oddness around when to remove a character vs calling do_palatalise) but changing Java to use the Snowball suffixes here leads to a clear regression, so adjust the Snowball implementation to match Java implementation.
This case was inconsistent with all the other cases where we call palatalise as we remove the whole suffix here but leave the first character in every over case. Checking the vocabulary list, this means palatalise will almost never match one of the suffixes, as the only words with this as an ending in the list are these, which look like they're actually English words (except "abies"): abies cookies hippies series studies This means palatalise will just remove the last character, which seems odd. This change changes a lot of stems but seems to be an improvement in pretty much every instance I checked in google translate.
There are two issues here: One seems clearly unintentional, which is that the cursor position from do_case wasn't reset. The other is that do_possessive was only called if do_case did something which does not match the Java implementation. It seems likely this was not intended, and testing suggests it's not a helpful change.
For the test vocabulary, this results in 1877 merges of groups of stems (all seem reasonable), 427 splits (all seem unhelpful) and 300 reshufflings of stems between existing groups (all seem neutral). Overall this seems a very clear improvement, but we should see if we can address the splits.
In order to try to better understand this I compared the suffixes with those listed at https://en.wikipedia.org/wiki/Czech_declension (which I'd expect to be a reliable source for something like this, but if there's a better one please point me at it). Suffixes we remove but which wikipedia's list doesn't seem to support:
I could perhaps believe There are also two suffixes we don't remove but wikipedia lists:
@hauktoma Can you help resolve any of these? |
Perhaps tired and/or old eyes mistaking the accent for a simple dot when reading a declension list is a plausible explanation though... |
BTW if it's useful there's a list of 58133 words in I did a quick grep and the suffixes that don't appear in the wikipedia list all seem to be pretty rare - most common is 44 for |
I think it might be useful to hammer out the last few details, but let's see. Don't worry too much about not having formal linguistic training - these stemmers are ultimately meant to be practical aids to information retrieval rather than exercises in linguistics. Understanding the grammar/suffix structure of the language is useful to inform the design, but if you speak it natively you should have that (though that knowledge may be rather implicit in your mind so you might need to think about it more than you usually do). I'll put where I think R1 would start (marked with
Without
Note that R1 only defines a region within which suffixes can be removed, not the cut point to remove anything after. So in this case R1 is indeed at
R1 would be
If
If
Yes,
The light stemmer wouldn't remove a suffix regardless for either of these. The aggressive stemmer would remove
Without syllabic consonant handling we remove So I don't think there's anything very compelling either way for whether to treat Additionally: I tried enforcing a minimum length of 3 characters before the start of R1 (which the German, Danish and Dutch algorithms have) in addition to the special handling of |
@hauktoma It occurred to me to simply try removing each of the suspect suffixes and see what the
Dropping
Dropping
Looking up these words, they do seem to indeed be examples of these two suffixes and
Dropping
Dropping Dropping Dropping
Dropping
I think this at least resolves that The sample vocabulary might be too small though - it's 58133 words which was all words which occurred at least 100 times in Czech wikipedia on 2021-08-21 - most Czech words have a lot of different forms so that might be too small a list. I could generate a larger one by using a lower threshold and/or a more recent wikipedia dump as it's likely grown a bit in 3 years. Or if someone knows of a suitably licensed Czech word list we could use that instead (or merge with the existing list). |
Looking at the sample vocabulary, these are the entries which end
As best I can make out, the appropriate stem for the first is These are the entries which end
With the current R1 definition, we leave |
Statistical toolsI tried to review the above and got a feeling that guessing the language rules (and tradeoffs) might not be the optimal approach. It seems to me that since we are taking the algorithmic approach, there will be tradeoffs and imo I would bet on some kind of statistical evidence to evaluate the tradeoffs more than native-language speaking skills. Also noticed following comment that seems to point in the similar direction:
So I tried to dig for something and maybe stumbled upon something useful. @ojwb can you please check the links below if that is something there you think might be useful to us? Licenses might be good at least for analysis (https://creativecommons.org/licenses/by/4.0/ for corpuses). Chances are that some of the tools below will provide means to enhance your experimenting workflow significantly. I'll try to dig further whether they will be usable on some of the problems above, e.g. blacklisting particular suffixes or Note: all of the webs below seem to be switchable to native English variant, so they should be approachable. Czech national corpus applicationshttps://www.korpus.cz/ and especially https://www.korpus.cz/apps This is some kind of Czech Academic project that provide multiple applications for analyzing language statistically. There are at least 10 different online apps that specialize in different use-cases. Czech national corpus raw datahttps://wiki.korpus.cz/doku.php/en:cnk:uvod These are Czech Text Corpuses, the largest has 935M Czech words (although is from 2013). The recent have e.g. 100M words. MorfFlex Czech morphological dictionaryhttps://ufal.mff.cuni.cz/morfflex This is some kind of dictionary that consists of list "lemma-tag-wordform" and it should somehow contain declesion metadata/relations. Seems powerful, but will be hard to use initially it seems. Note on diacriticsRegarding the following:
I'am not 100% sure about this, but from the perspective of the real use-case of fulltext search (e.g. doing some searches using something like https://www.elastic.co/elasticsearch and having snowball set there as cz stemmer), I would say that the stemmer should probably not consider the diacritics and work without it at all times. The reason is that the input to be stemmed will come from user (user will type something into some kind of search box) and I would bet that significant amount of that text will not contain any diacritics. I would say that diacritics is used for proper text and formal communication, but for informal communication (mails, messengers and similar) or practical use (google something), Czech person will not bother with diacritics. The contra argument to this might be that the functionality of diacritics-suffix removal would be purposefully applied only in case when user intentionally uses it, e.g.:
Or maybe analysis is needed to check whether by enabling the suffix removal in both forms (diacritics and non diacritics) will not break something significant and do this only for suffixes where it is safe. |
I did a quick check by extracting sets of words which differ only by diacritics and pasted some of them into google translate - the vast majority of these sets appear to have the same (or similar enough) meanings, so that's promising. That doesn't take into account the interaction with stemming. I think it makes sense to try to resolve most of the remaining points and come back to this once the stemming rules are mostly finalised.
Thanks, will take a look. CC licences are fine for test data so long as they aren't the NC (non-commercial) or ND (no derivates) variants. |
Testing seems to show this was never helpful and sometimes harmful.
-es seems to be a valid suffix (e.g. diabetes) but there seem to be more cases where it is harmful to remove. -ich seems to only be a suffix for two pronouns. -iho doesn't seem to be a valid suffix and removing it makes no difference on the test vocabulary.
We wouldn't currently stem There's also (And Overall it seems |
This is a valid Czech suffix and removing it seems beneficial (88 cases in the sample vocabulary, all seem to be improvements).
Use a definition of R1 more like the usual Snowball one, but take syllabic consonants 'l' and 'r' into account. It seems 'm' and 'n' can also be syllabic consonants but are much rarer so we ignore these for now at least. Testing suggests enforcing a minimum of 3 characters before R1 (like the Danish, Dutch and German stemmers do) helps so we do that here too. See snowballstem#151
We can just handle the first character specially - after that we know the previous character is a consonant because otherwise we'd have already stopped. See snowballstem#151
I've merged the new R1 with Adding |
There seems no benefit from having a separate region we can remove possessive suffixes in. See snowballstem#151
@hauktoma I'm curious about mzda - it seems the natural stem would be "mzd" but there's no vowel or syllabic consonant in the first 3 characters so R1 gets set to start at the end of the word and so no suffix can be removed. I'm not very familiar with IPA - my reading of the pronunciation in wiktionary is there's stress on the "m" but how many syllables would you pronounce this word as? Not stemming this single isolated word is not a big problem in itself, but if it indicated a problem with our R1 definition there could perhaps be many more cases hiding. I tried to grep to find more and also found "sklo" but that was only looking for ones which had a |
[palatalise change]
Changes since have fixed 7 of these 17 splits, leaving 10 of which I don't see any pattern here we could exploit and this doesn't affect very many cases anyway. |
I noticed this in https://en.wikipedia.org/wiki/Czech_declension#Nouns:
It'd be good to handle these cases as they seem fairly common (617 in the sample vocabulary though my incomplete checking suggests a small number are actually better as-is). The obvious approach is to add a rule to remove 'e' for words where we don't remove a case ending which end 'e' and this does help fold in a lot of cases. Unfortunately it also splits out a lot of cases where the stem just happens to end in this way, not due to a "floating e" - 569 splits - I didn't spot any which were improvements, but a few may be. Overall it seems the gains and losses are comparable, but perhaps there's some way to do it (maybe restricted to a subset of cases) such that it's worth doing. I can't see any pattern to distinguish the floating 'e' cases though. In my test I added this to the
We're working backwards here, so that requires the word to end in a non-vowel, before that an 'e', both must be in R1, and before them must be another non-vowel - if that's all true we delete the 'e'. My logic for requiring the 'e' to be in R1 is that the stem should presumably have contained a vowel or a syllabic consonant before the floating 'e' was added, and checking we're in R1 achieves that - swapping |
Looked into However if we're aiming to handle queries without diacritics better, perhaps we should remove I've also looked at making the stemmer strip diacritics as a first step with the rules adjusted suitably, but it definitely seems problematic - I think we probably want to try diacritic-free versions of suffixes where removing them doesn't seem problematic, which will at least handle some cases where the user omits diacritics from their query (or where text being indexed lacks them, which might be the case for more datasets of more informal text). |
Looking at conditions on palatalise, if we require the suffixes
Both merges seem better as does the first split; other splits seem less good, so on this wordlist that's 3 better, 4 worse, but the first split fixes an unwanted conflation which is arguably worth more - still not a huge improvement. Comparing Requiring the other palatalise rules to be in R1 changes more cases - including the above is 337 words change stem, 80 not interesting, 65 merges, 40 splits, 77 words move between stem groups. Overall this seems about neutral too. I also looked at requiring either/both to be in a region starting one character before R1 but that doesn't help. |
This has been on the web site since 2012, but never actually got
included in the code distribution.
Points to resolve:
č
suffix in snowball vsče
in Java (Snowball seems to have copied-č
typo in Java comment)čtí
/ští
in Java vsčté
/šté
in Snowball (again seems to be due to Java comment typo)len- 2
instead oflen- 3
for Javaště
/šti
/ští
check. Seems fairly clear improvement.palatalise
.palatalise
doesn't otherwise match.do_case
doesn't make a replacement thendo_possessive
won't get called, but in the java code,removePossessives
is always called. Merge Czech stemmer #151 (comment)palatalise
except for-es
/-ém
/-ím
setlimit tomark p1 for ([substring])
vs[substring] R1
-ětem
isn't listed by https://en.wikipedia.org/wiki/Czech_declension but seems to be valid from e.g. https://en.wiktionary.org/wiki/hrab%C4%9B https://en.wiktionary.org/wiki/markrab%C4%9B and https://en.wiktionary.org/wiki/ml%C3%A1d%C4%9B-os
,-es
,-iho
,-imu
aren't listed by https://en.wikipedia.org/wiki/Czech_declension-ich
seems to only be a suffix for two pronouns-ima
? Probably not.-ímu
(with a diacritic on thei
)? Yes.-ěte
and-ěti
while the aggressive stemmer removes-ete
and-eti
(no caron on the e). The snowball implementation follows the light stemmer. The older version of the light stemmer listed in the original paper removes all four suffixes. Analysis in Merge Czech stemmer #151 (comment) suggests maybe to leave as-is? Probably this was trying to make the stemmer partly ignore diacritics, see next point.{ desce desk deska deskami deskou desková deskové deskový deskových desku desky deskách } + { dešti deštích deště }
- seems to be conflating "plate" and "rain"; simple tests suggest this (and numerous other conflations due to palatalise) are fixable my imposing some sort of region check on the palatalise step, but need to experiment to determine what region definition is appropriate (and whether it should the same for all palatalise replacements)