Increase accuracy by weighting probs with softmax #9

bminixhofer · 2020-04-05T10:11:50Z

Hi!

I just came across this library. Great idea, and great implementation!

This PR improves the score on GermaNet compounds from 95.20% to 95.91% on my machine. This is done by:

considering all ngrams for prefix and suffix like in the original thesis instead of only using the longest ngram.
instead of aggregating the scores for prefix and suffix with a maximum, use a softmax of the length of each ngram as weight.

Intuitively, this works because longer ngrams are more significant, but shorter ngrams also carry some significance, so adding them with a lower weight instead of completely discarding them improves the scores.

Note that for infixes softmax weighting did not improve the GermaNet score for me, so I did not change the aggregation method for infixes.

Concerns

The downside to this PR is speed. As you mentioned in the comments, only considering the longest ngrams improves speed. This PR would undo that improvement.

See snippets of running GermaNet evaluation without this PR and with this PR:

Without:

In [2]: %%time  
   ...: splitter.germanet_evaluation() 
   ...:  
   ...:                                                                                                                                  
10000 Accuracy (9520/10000):  95.2
20000 Accuracy (19048/20000):  95.24
30000 Accuracy (28591/30000):  95.30333333333333
40000 Accuracy (38118/40000):  95.295
50000 Accuracy (47653/50000):  95.306
60000 Accuracy (57188/60000):  95.31333333333333
70000 Accuracy (66708/70000):  95.29714285714286
80000 Accuracy (76255/80000):  95.31875
82325 Accuracy (78377/82325):  95.20437291223809
CPU times: user 5.82 s, sys: 161 ms, total: 5.99 s
Wall time: 5.98 s

With:

In [2]: %%time  
   ...: splitter.germanet_evaluation() 
   ...:  
   ...:                                                                                                                                                                                                             
10000 Accuracy (9591/10000):  95.91
20000 Accuracy (19201/20000):  96.005
30000 Accuracy (28826/30000):  96.08666666666667
40000 Accuracy (38417/40000):  96.0425
50000 Accuracy (47990/50000):  95.98
60000 Accuracy (57589/60000):  95.98166666666667
70000 Accuracy (67213/70000):  96.01857142857143
80000 Accuracy (76825/80000):  96.03125
82325 Accuracy (78960/82325):  95.91254175523838
CPU times: user 13.6 s, sys: 124 ms, total: 13.8 s
Wall time: 13.8 s

So it takes ~ 2x more time. I am not sure if the additional 0.7% is worth it in all cases, so the softmax weighting should maybe be put behind a feature flag.

bminixhofer · 2020-04-05T10:19:56Z

On another note, I have been thinking about incorporating compound splitting into my library NNSplit and to use this library for generating the labels.

This would require:

Making it language agnostic (the only language specific part I have seen so far is the part with Fugen-S, that would have to be removed or only enabled for German).
A rewrite of splitter.py in Rust (but with Python bindings) to make it fast enough for generating training data.
Returning the parts directly without titlecasing.
Splitting into the smallest possible parts (e. g Autobahnraststätte -> [Auto, bahn, rast, stätte]).

Would you be interested in a PR for this as well or is that out of scope for this library?

dtuggener · 2022-08-30T07:56:41Z

This totally slipped my attention, sorry for getting back so late (2 years!) and thanks for the PR.
Using softmax is an interesting idea. As you said, the slow down might not be worth the small increase in accuracy, depending on the situation. If you're still willing to update the PR and put the feature behind a flag/param, then I can accept it.
As for your second comment, I think the implementation in Rust would be outside the scope of this repo. I would recommend forking it.

use softmax of ngram lengths as prob weight

ab61315

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase accuracy by weighting probs with softmax #9

Increase accuracy by weighting probs with softmax #9

bminixhofer commented Apr 5, 2020

bminixhofer commented Apr 5, 2020 •

edited

Loading

dtuggener commented Aug 30, 2022

Increase accuracy by weighting probs with softmax #9

Are you sure you want to change the base?

Increase accuracy by weighting probs with softmax #9

Conversation

bminixhofer commented Apr 5, 2020

Concerns

bminixhofer commented Apr 5, 2020 • edited Loading

dtuggener commented Aug 30, 2022

bminixhofer commented Apr 5, 2020 •

edited

Loading