Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase accuracy by weighting probs with softmax #9

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bminixhofer
Copy link

Hi!

I just came across this library. Great idea, and great implementation!

This PR improves the score on GermaNet compounds from 95.20% to 95.91% on my machine. This is done by:

  1. considering all ngrams for prefix and suffix like in the original thesis instead of only using the longest ngram.
  2. instead of aggregating the scores for prefix and suffix with a maximum, use a softmax of the length of each ngram as weight.

Intuitively, this works because longer ngrams are more significant, but shorter ngrams also carry some significance, so adding them with a lower weight instead of completely discarding them improves the scores.

Note that for infixes softmax weighting did not improve the GermaNet score for me, so I did not change the aggregation method for infixes.

Concerns

The downside to this PR is speed. As you mentioned in the comments, only considering the longest ngrams improves speed. This PR would undo that improvement.

See snippets of running GermaNet evaluation without this PR and with this PR:

Without:

In [2]: %%time  
   ...: splitter.germanet_evaluation() 
   ...:  
   ...:                                                                                                                                  
10000 Accuracy (9520/10000):  95.2
20000 Accuracy (19048/20000):  95.24
30000 Accuracy (28591/30000):  95.30333333333333
40000 Accuracy (38118/40000):  95.295
50000 Accuracy (47653/50000):  95.306
60000 Accuracy (57188/60000):  95.31333333333333
70000 Accuracy (66708/70000):  95.29714285714286
80000 Accuracy (76255/80000):  95.31875
82325 Accuracy (78377/82325):  95.20437291223809
CPU times: user 5.82 s, sys: 161 ms, total: 5.99 s
Wall time: 5.98 s

With:

In [2]: %%time  
   ...: splitter.germanet_evaluation() 
   ...:  
   ...:                                                                                                                                                                                                             
10000 Accuracy (9591/10000):  95.91
20000 Accuracy (19201/20000):  96.005
30000 Accuracy (28826/30000):  96.08666666666667
40000 Accuracy (38417/40000):  96.0425
50000 Accuracy (47990/50000):  95.98
60000 Accuracy (57589/60000):  95.98166666666667
70000 Accuracy (67213/70000):  96.01857142857143
80000 Accuracy (76825/80000):  96.03125
82325 Accuracy (78960/82325):  95.91254175523838
CPU times: user 13.6 s, sys: 124 ms, total: 13.8 s
Wall time: 13.8 s

So it takes ~ 2x more time. I am not sure if the additional 0.7% is worth it in all cases, so the softmax weighting should maybe be put behind a feature flag.

@bminixhofer
Copy link
Author

bminixhofer commented Apr 5, 2020

On another note, I have been thinking about incorporating compound splitting into my library NNSplit and to use this library for generating the labels.

This would require:

  • Making it language agnostic (the only language specific part I have seen so far is the part with Fugen-S, that would have to be removed or only enabled for German).
  • A rewrite of splitter.py in Rust (but with Python bindings) to make it fast enough for generating training data.
  • Returning the parts directly without titlecasing.
  • Splitting into the smallest possible parts (e. g Autobahnraststätte -> [Auto, bahn, rast, stätte]).

Would you be interested in a PR for this as well or is that out of scope for this library?

@dtuggener
Copy link
Owner

This totally slipped my attention, sorry for getting back so late (2 years!) and thanks for the PR.
Using softmax is an interesting idea. As you said, the slow down might not be worth the small increase in accuracy, depending on the situation. If you're still willing to update the PR and put the feature behind a flag/param, then I can accept it.
As for your second comment, I think the implementation in Rust would be outside the scope of this repo. I would recommend forking it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants