Slow calculation of ambiguity feature in MLLM #822

RietdorfC · 2024-12-16T09:17:23Z

Dear Osma, dear annif-team,

As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.

Best regards
Clemens

tsets.zip

osma · 2024-12-16T09:42:13Z

Thanks @RietdorfC !

The tsets.jsonl file is quite revealing: you have some matches with extreme repetition, especially token id 194284. I'm not sure what it is without having access to the model internals, but it seems to be some word that matches a lot of different GND subjects (2628 to be exact). It could be a common name like "Smith"; if you have a lot of names like "Smith, A." and "Smith, B." (even as altLabels) in GND, then the analyzer in Annif will likely discard the initials (because they are too short to be considered words) and MLLM will just see a lot of concepts all having the same label "smith", which are potential matches every time the word "Smith" appears in the document text.

I'll see if anything can be done to speed up the slow ambiguity calculation, but this is a symptom of matching gone wrong in other ways as well.

osma · 2024-12-20T12:57:55Z

Hi @RietdorfC , I've now implemented a new, hopefully much faster method for calculating the ambiguity feature in PR #825. Could you please test the code in that branch? I'm especially interested in

Does the code run in your environment?
Does it reduce the train and suggest time for MLLM?
Does it achieve the same level of quality?

RietdorfC · 2025-01-06T12:22:32Z

Hi @osma,
Thank you for your quick reply and helpfull answer and the implementation of the new method!

I have found the token that was responisble for the large number of matches (and the coressponding matches). We will investigate this issue further.

I will test your new method and report back to you as soon as possible.

Best regards
Clemens

osma linked a pull request Dec 20, 2024 that will close this issue

Optimize MLLM ambiguity calculation #825

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow calculation of ambiguity feature in MLLM #822

Slow calculation of ambiguity feature in MLLM #822

RietdorfC commented Dec 16, 2024

osma commented Dec 16, 2024

osma commented Dec 20, 2024

RietdorfC commented Jan 6, 2025

Slow calculation of ambiguity feature in MLLM #822

Slow calculation of ambiguity feature in MLLM #822

Comments

RietdorfC commented Dec 16, 2024

osma commented Dec 16, 2024

osma commented Dec 20, 2024

RietdorfC commented Jan 6, 2025