Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow calculation of ambiguity feature in MLLM #822

Open
RietdorfC opened this issue Dec 16, 2024 · 3 comments · May be fixed by #825
Open

Slow calculation of ambiguity feature in MLLM #822

RietdorfC opened this issue Dec 16, 2024 · 3 comments · May be fixed by #825

Comments

@RietdorfC
Copy link

Dear Osma, dear annif-team,

As discussed in the annif-users-group (https://groups.google.com/g/annif-users/c/8d3AL4LAzBQ), I have added the debugging lines and performed the suggest operation with an MLLM model trained with the full GND vocabulary set we use (1.4M subjects) on a document with a long processing time (305.72 sec.). Please find the ziped tsets.jsonl file attached to this issue.

Best regards
Clemens

tsets.zip

@osma
Copy link
Member

osma commented Dec 16, 2024

Thanks @RietdorfC !

The tsets.jsonl file is quite revealing: you have some matches with extreme repetition, especially token id 194284. I'm not sure what it is without having access to the model internals, but it seems to be some word that matches a lot of different GND subjects (2628 to be exact). It could be a common name like "Smith"; if you have a lot of names like "Smith, A." and "Smith, B." (even as altLabels) in GND, then the analyzer in Annif will likely discard the initials (because they are too short to be considered words) and MLLM will just see a lot of concepts all having the same label "smith", which are potential matches every time the word "Smith" appears in the document text.

I'll see if anything can be done to speed up the slow ambiguity calculation, but this is a symptom of matching gone wrong in other ways as well.

@osma osma linked a pull request Dec 20, 2024 that will close this issue
@osma
Copy link
Member

osma commented Dec 20, 2024

Hi @RietdorfC , I've now implemented a new, hopefully much faster method for calculating the ambiguity feature in PR #825. Could you please test the code in that branch? I'm especially interested in

  1. Does the code run in your environment?
  2. Does it reduce the train and suggest time for MLLM?
  3. Does it achieve the same level of quality?

@RietdorfC
Copy link
Author

Hi @osma,
Thank you for your quick reply and helpfull answer and the implementation of the new method!

I have found the token that was responisble for the large number of matches (and the coressponding matches). We will investigate this issue further.

I will test your new method and report back to you as soon as possible.

Best regards
Clemens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants