You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.
Steps to reproduce
Add print(tokenized_text) in line 57 in utils.py to see the wrong tokens used for the xlm-robert-large based compression model.
Expected Behavior
The correct special tokens should be used for the respective compression model.
Additional Information
LLMLingua version: 0.2.2
The text was updated successfully, but these errors were encountered:
Describe the bug
The
TokenClfDataset
is initialized without amodel_name
parameter and therefore defaults tobert-base-multilingual-cased
, meaning that incorrect special tokens are used in llmlingua-2, i.e.instead of
The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.
Steps to reproduce
Add
print(tokenized_text)
in line 57 inutils.py
to see the wrong tokens used for the xlm-robert-large based compression model.Expected Behavior
The correct special tokens should be used for the respective compression model.
Additional Information
The text was updated successfully, but these errors were encountered: