Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ukrainian stemmer #178

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

abratashov
Copy link

No description provided.

@abratashov abratashov force-pushed the Add-Ukrainian-stemmer branch from eda2f49 to ff6c73a Compare June 10, 2023 18:01
@abratashov abratashov force-pushed the Add-Ukrainian-stemmer branch from ff6c73a to fd30acf Compare June 10, 2023 18:03
@ojwb
Copy link
Member

ojwb commented Sep 20, 2023

@abratashov Have you finished working on this? It seems additional changes get pushed from time to time, and I can still see commented out code and questions in comments in the .sbl file...

Also, can you clarify how this relates to the Ukrainian stemmer in #144?

It seems they've been separately developed, both starting from the Snowball Russian stemming algorithm.

The original author of the code in #144 made some comments about it - notably that it doesn't try to remove prefixes (as best I can tell yours doesn't either?), and uses a cruder length check than the usual Snowball R1/R2/RV approach which the Russian stemmer and yours use.

Comparing output on the sample vocabulary from snowballstem/snowball-data#18 I can see quite a few cases which the older submission appears to handle better (I can't read Ukrainian though, so maybe these are incorrect conflation of similar words with different meanings), e.g. here's an annotated screenshot with your stemmer on the right:

ukrainian stemmer output comparison

I've marked in green vs red where it looks to me like one stemmer is doing a better job.

In this screenful there's one word where yours seems better, but the other stemmer seems better overall. This varies as I page through the file, but if I had to pick the stemmer from #144 seems like it's a bit better. However I should reiterate that's an impression I've formed without any knowledge of what the words I'm looking at actually mean!

One likely flaw I spotted with the other stemmer is it can reduce words to a single letter, which is not necessarily always wrong, but is liable to conflate unrelated words given there are only 33 possible single letter stems - I suspect that's a result of using an initial length check instead of restricting removal to suffixes in R1/R2.

@abratashov
Copy link
Author

@ojwb thanks for your checks on this PR, yes I'm polishing it!

With the help of other guys from Ukraine and the international community, this year I've dived deeper into the Snowball stemmer and this area at all.

Currently, this PR contains the latest version of UA stemmer and some dev tools that facilitate development (utf <=> sbl converter), as well as some files with test words.

In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

Main questions:

  1. What PR should look like? Should it be the only one ukrainian.sbl file?
  2. How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha
  3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

@Tapkomet
Copy link

Tapkomet commented Sep 20, 2023

@ojwb thanks for your checks on this PR, yes I'm polishing it!

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

2. How to estimate the quality of stemmer? Are there any tools for that? 

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

Thanks!

I believe I can help a bit with questions 2 and 3. When I worked on this, I built a Java project - I believe there are instructions on how to do it on the Snowball website. IIRC I had to rebuild it whenever I made edits to the .sbl file. (I should note that the project would come out slightly wrong, with incorrectly set imports, but when I fixed that it would be workable).

Afterwards, I simply had a text file in the project folder with a bunch of Ukrainian text (I copy-pasted a bunch of Ukrainian Wikipedia articles into the file as source material), and the program would output the results to a results text file.

For measuring output of the stemmer, I would simply go through a significant amount of results at random (like a hundred or two) and tally up the number of errors. Obviously I had to judge by myself what was an error and what wasn't, so it was subjective in some cases.

If you want to see examples, I am attaching the txt file containing source text, and the results file. The results file pairs each stemmed word with its original form (first stemmed, then original), e.g. авторств авторство

testUkrainian.txt
Results.txt

@ojwb
Copy link
Member

ojwb commented Sep 20, 2023

In the near future, I'm exploring this stemmer #144 As I know this PR was opened by @tggo who just took (if I'm not wrong, because I couldn't contact with him) the original SBL https://github.com/Tapkomet/UAStemming/blob/master/stem_ukr.sbl from @Tapkomet. @Tapkomet just created his UA stemmer for educational purposes, so I'll use the all advantages of it too soon.

#144 is the "UAStemming" code with one change - it uses the newer {U+nnnn} notation for Unicode codepoints instead of hex nnnn (the way hex is specified means you need a modified version of the Snowball source to support single byte character sets, whereas the newer syntax allows us to have a single version of the source of each algorithm - I don't know if KOI8-U is still relevant, but if it were it would help for that).

Main questions:

1. What PR should look like? Should it be the only one `ukrainian.sbl` file?

This is detailed in CONTRIBUTING.rst, but essentially just the new file and an update to modules.txt. Everything should automatically work from that.

Test coverage is provided via the data files in snowball-data (which make check, make check_java etc in snowball will use automatically), which are in a separate repo as they're much larger than code itself. These provide test coverage for all languages Snowball can generate code for so are a better approach than writing test scripts in a particular languages, which would need writing 9 times, and any update applying in 9 places.

Please keep each PR to one purpose - make dev tools, etc their own PR(s). Reviewing a larger PR is harder and takes longer, and everything ends up blocked by a blocker in one part.

2. How to estimate the quality of stemmer? Are there any tools for that? CC: @arysin , @amakukha

Looking at the output of ./stemtest -l ukrainian -p2 < some-ukrainian-word-list.txt gives an idea (the screenshot above is just that output for the two stemmers compared in vimdiff). We don't have anything more sophisticated.

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

3. Where should I keep test sets of words (*.txt, *.yml etc)? Because I can't find any test case in the original Snowball repository.

snowball-data (again, read CONTRIBUTING.rst).

There's a wordlist extracted from Ukrainian wikipedia in snowballstem/snowball-data#22 (I think the submitter closed it after realising the algorithm had already been submitted, but the earlier submission had a wordlist that seems much too short so I'd suggest this one unless you have a better one which is suitably licensed).

This was referenced Sep 20, 2023
@abratashov
Copy link
Author

Now everything is clear, thanks for the answers, will do it!

@ojwb
Copy link
Member

ojwb commented Oct 5, 2023

I'm (very) slowly working on a script which attempts to describe the changes resulting from a proposed code change to a stemming algorithm, which is sort of related but different.

This is now in the snowball-data repo as scripts/stemmer-compare - you might find it useful for evaluating potential changes you're considering making to the algorithm.

It takes a vocabulary list and two output files with stemmed versions and attempts to describe the changes. It can spot and describe some simple cases of merged or split groups of stems, and some cases where a stem moves between groups. Testing so far suggests it does better than I'd hoped for evaluating small tweaks to an algorithm, but it does less well for comparing "porter" vs "english" (where the latter evolved from the former) and isn't really useful for "dutch" vs "kraaij_pohlmann" (which are two separately developed Dutch stemming algorithms). It'll likely improve with time.

Sample excerpts of output for a recent tweak to the swedish stemmer:

A total of 342 words changed stem

* 273 words changed stem but aren't interesting:
  altröst, amitiöst, anderöster, andraröster, [...]

* 53 merges of groups of stems:
  { ambitiöst } + { ambitiös, ambitiösa, ambitiösare, ambitiösaste, ambitiöse }
  { amoröst } + { amorös, amorösa, amoröse }
  { avlöst, avlösta, avlöste, avlöstes, avlösts } + { avlösa, avlösande, avlösare, avlösas, avlöser, avlöses }
[...]

@abratashov abratashov force-pushed the Add-Ukrainian-stemmer branch from 2ab61be to 2e638b9 Compare December 12, 2023 21:25
@abratashov
Copy link
Author

@ojwb I've updated the current stemmer with new rules, also opened PR with test words snowballstem/snowball-data#24

I hope during next month I'll polish it to a production-ready release!


// Apostrophe-like symbols
// stringdef a_apostrophe '{U+0027}' // '
// stringdef a_grave_accent U+0060 // ` cannot to remove system char in Snowball
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment here - there's nothing special about this character in Snowball. Maybe you were just missing the '{ and }' around it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll remove unnecessary apostrophe-like symbols

do repeat ( goto (['{a_lsq_mark}']) delete )
do repeat ( goto (['{a_rsq_mark}']) delete )
do repeat ( goto (['{a_shr9q_mark}']) delete )
do repeat ( goto (['{a_prime}']) delete )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all these actually occur in real-world Ukrainian text in place of an apostrophe? There's an overhead to checking for them so I'm dubious about handling characters just because they look kind of like an apostrophe if they don't actually get used in practice.

Possibly snowball should have a more efficient way to transliterate (or delete) a set of characters from in string, but currently the above is a reasonable approach but involves scanning the input once for each character.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until about 10 years ago, there was a lack of Ukrainian keyboard layouts with proper apostrophes and also a lack of OCR software that supported Ukrainian symbols correctly. That resulted in a huge amount of texts, where lots of different Unicode characters that look similar to the apostrophe were used.
In the last decade though the situation improved quite a bit, so now it's mostly down to 3: U+0027, U+02BC, U+2019

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess it might matter for some cases (i.e. users with a lot of textual data created by OCR over 10 years ago which they've not managed to clean up).

I'm happy for people familiar with the situation to decide what's appropriate - mostly I just wanted to flag this in case this was a instance of attempting theoretical completeness without realising it would add overhead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the situation got much better with texts lately. Also with old/unreliable sources I'd expect some text cleaning to happen before they'll be used anywhere anyway. I don't have a strong feeling either way, but if I had to choose I'd say those 3 should be enough for most cases (maybe adding a note to the stemmer's README).

define remove_vowel_before_vowel as (
[substring] among (
'{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}'
('{a}' or '{e}' or '{ye}' or '{y}' or '{i}' or '{yi}' or '{i`}' or '{o}' or '{u}' or '{soft}' or '{iu}' or '{ia}' delete )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long or chain is less efficient - better to replace this line with an among which can check for a set of n strings in O(log(n)) instead of O(n):

      ( among ('{a}' '{e}' '{ye}' '{y}' '{i}' '{yi}' '{i`}' '{o}' '{u}' '{soft}' '{iu}' '{ia}') delete )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again, more efficient still would be to use a grouping. Above add vowel to the groupings list, then define it as:

define vowel as v + '{i`}{soft}'

(Maybe vowel is a bad name for this if v is the "real" vowels. Or maybe these two should actually just be in v anyway?)

Then this function becomes:

define remove_vowel_before_vowel as (
    [vowel] vowel delete
)

The other among uses where it's just a list of individual characters with a single common action could be done similarly.

The snowball compiler could be smarter and turn such an among into a grouping but the Snowball code for the grouping version actually seems clearer.

(It looks to me like this function is a bit misnamed as it actually seems to remove a vowel which is after a vowel since it's working in backwardmode, but if I follow the code it probably would be both clearer and more efficient to eliminate this function and make remove_last_2_vowels just do [vowel vowel] delete).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants