Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on if snowball (specifically python implemenation) is not thread-safe #162

Closed
DBCerigo opened this issue Dec 23, 2021 · 8 comments

Comments

@DBCerigo
Copy link
Contributor

Hi, we've been experiencing intermittent inconsistent outputs (i.e. bug) when using snowball with dask multiprocessing. We can stop these bugs occurring by any of a) using a single process/thread, b) removing the stemming, or c) moving the instantiation of the stemmer inside the function which is being applied within threads.

Could someone with expertise input on whether snowball is thread-safe or not?

Might be related to #146 which seems to imply that the C# implementation is not thread safe.

@DBCerigo
Copy link
Contributor Author

Found it in the docs https://github.com/zvelo/libstemmer/blob/78c149a3a6f262a35c7f7351d3f77b725fc646cf/README#L45

Stemmers are re-entrant, but not threadsafe.  In other words, if
you wish to access the same stemmer object from multiple threads,
you must ensure that all access is protected by a mutex or similar
device.

Apologies for the extra noise.

@ojwb
Copy link
Member

ojwb commented Dec 23, 2021

I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.

Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C struct or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.

@DBCerigo
Copy link
Contributor Author

I should point out that's not our repo you've linked to - it's a git repo created from an ancient tarball from the old website (snowball.tartarus.org) - it's essentially a decade-old (and, judging from their git history, barely maintained) fork, and definitely not a canonical source of information about current Snowball releases from our repo. You're also looking at the README for the C libstemmer, not Python.

Thanks for the clarification, appreciated.

Your conclusion is right though, and I think it's currently true for all the target languages. The issue is that the Snowball cursor position, limits, slice, and any Snowball variables are stored in the object (or C struct or whatever). It would probably be possible to avoid this, but it would need to be done without adding overhead per word stemmed as we expect a lot of words to need stemming in the intended domain of use.

I couldn't find a similar reference to Snowball not being thread-safe within this repo, but it is likely I have just not managed to find it. If there isn't anywhere in the docs that state that Snowball instances are not thread-safe and that to use in a multi-thread/process run one should instantiate a Snowball stemmer instance within each thread (as opposed to sharing one instance across multiple threads), then that could be a valuable addition to the docs to help future uses avoid such race-condition intermittent bugs that are particularly hard to track down.

Thanks for all your time and energy that goes into this package. It is appreciated!

@ojwb
Copy link
Member

ojwb commented Dec 23, 2021

doc/libstemmer_c_README is the equivalent of the file you were looking at in the other repo.

As you say, we should document this in all the target language README files though.

@ojwb ojwb reopened this Dec 23, 2021
@DBCerigo
Copy link
Contributor Author

DBCerigo commented Dec 23, 2021

@ojwb if you point me to where that is I can have a go at adding it if that would be helpful.

I misread you message, it's clear what is requested.

@ojwb
Copy link
Member

ojwb commented Dec 23, 2021

A patch would be lovely. The target language README files are either in doc/ or the language subdirectory - see output of: git ls-files '*/*README*'

One thought - I don't write a lot of heavily multithreaded code, but I think it would probably be better advice to suggest creating one stemmer object per thread wanting to stem words - currently we talk about sharing one stemmer object and protecting it with a mutex or similar, but that adds contention on a single resource without a good reason to.

@DBCerigo
Copy link
Contributor Author

DBCerigo commented Dec 23, 2021

#163 :)

I concur(ed) with your thought on advising user to instantiate per thread is more user friendly than advising them to use a mutex/lock.

@ojwb
Copy link
Member

ojwb commented Nov 9, 2022

Fixed by #163.

@ojwb ojwb closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants