[Bug]: LLMLingua-2 uses wrong special tokens by default #181

cornzz · 2024-09-13T11:54:47Z

Describe the bug

The TokenClfDataset is initialized without a model_name parameter and therefore defaults to bert-base-multilingual-cased, meaning that incorrect special tokens are used in llmlingua-2, i.e.

    if "bert-base-multilingual-cased" in model_name:
            self.cls_token = "[CLS]"
            self.sep_token = "[SEP]"
            self.unk_token = "[UNK]"
            self.pad_token = "[PAD]"
            self.mask_token = "[MASK]"

instead of

    elif "xlm-roberta-large" in model_name:
            self.bos_token = "<s>"
            self.eos_token = "</s>"
            self.sep_token = "</s>"
            self.cls_token = "<s>"
            self.unk_token = "<unk>"
            self.pad_token = "<pad>"
            self.mask_token = "<mask>"

The tokenizer simply treats these wrong special tokens (bos/eos/pad) as unknown tokens, I don't know what effect that has exactly. The difference in compression is not very significant, but there is some difference.

Steps to reproduce

Add print(tokenized_text) in line 57 in utils.py to see the wrong tokens used for the xlm-robert-large based compression model.

Expected Behavior

The correct special tokens should be used for the respective compression model.

Additional Information

LLMLingua version: 0.2.2

The text was updated successfully, but these errors were encountered:

iofu728 · 2024-10-22T12:45:22Z

Hi @cornzz, thanks for your feedback. I will review this PR ASAP.

cornzz · 2024-12-12T13:33:37Z

Hi @iofu728, any update, did you get a chance to check if this finding was valid? 😅

cornzz added the bug Something isn't working label Sep 13, 2024

cornzz added a commit to cornzz/LLMLingua that referenced this issue Sep 13, 2024

Fix(LLMLingua-2): fix wrong special tokens being used (microsoft#181)

c8709e6

cornzz linked a pull request Sep 13, 2024 that will close this issue

Fix wrong special tokens being used for llmlingua-2 #182

Open

4 tasks

iofu728 self-assigned this Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

cornzz commented Sep 13, 2024 •

edited

Loading

iofu728 commented Oct 22, 2024

cornzz commented Dec 12, 2024 •

edited

Loading

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

[Bug]: LLMLingua-2 uses wrong special tokens by default #181

Comments

cornzz commented Sep 13, 2024 • edited Loading

Describe the bug

Steps to reproduce

Expected Behavior

Additional Information

iofu728 commented Oct 22, 2024

cornzz commented Dec 12, 2024 • edited Loading

cornzz commented Sep 13, 2024 •

edited

Loading

cornzz commented Dec 12, 2024 •

edited

Loading