Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

Open
cornzz opened this issue Sep 13, 2024 · 2 comments · May be fixed by #182
Open

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

cornzz opened this issue Sep 13, 2024 · 2 comments · May be fixed by #182
Assignees
Labels
bug Something isn't working

Comments

@cornzz
Copy link

cornzz commented Sep 13, 2024

Describe the bug

The TokenClfDataset is initialized without a model_name parameter and therefore defaults to bert-base-multilingual-cased, meaning that incorrect special tokens are used in llmlingua-2, i.e.

    if "bert-base-multilingual-cased" in model_name:
            self.cls_token = "[CLS]"
            self.sep_token = "[SEP]"
            self.unk_token = "[UNK]"
            self.pad_token = "[PAD]"
            self.mask_token = "[MASK]"

instead of

    elif "xlm-roberta-large" in model_name:
            self.bos_token = "<s>"
            self.eos_token = "</s>"
            self.sep_token = "</s>"
            self.cls_token = "<s>"
            self.unk_token = "<unk>"
            self.pad_token = "<pad>"
            self.mask_token = "<mask>"

The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.

Steps to reproduce

Add print(tokenized_text) in line 57 in utils.py to see the wrong tokens used for the xlm-robert-large based compression model.

Expected Behavior

The correct special tokens should be used for the respective compression model.

Additional Information

  • LLMLingua version: 0.2.2
@cornzz cornzz added the bug Something isn't working label Sep 13, 2024
@cornzz cornzz linked a pull request Sep 13, 2024 that will close this issue
4 tasks
@iofu728 iofu728 self-assigned this Oct 22, 2024
@iofu728
Copy link
Contributor

iofu728 commented Oct 22, 2024

Hi @cornzz, thanks for your feedback. I will review this PR ASAP.

@cornzz
Copy link
Author

cornzz commented Dec 12, 2024

Hi @iofu728, any update, did you get a chance to check if this finding was valid? 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants