[MLS-272] Fix Special Token Encode Difference #201

zxybazh · 2024-02-11T14:00:50Z

Currently in MLC Serve, we may encounter slightly different input tokens compare to expected due to the handling of special tokens. Take the recent codellama 70b model as an example, in the official document, a chat example is given as follows:

chat = [
    {"role": "system", "content": "System prompt    "},
    {"role": "user", "content": "First user query"},
    {"role": "assistant", "content": "Model response to first query"},
    {"role": "user", "content": "Second user query"},
]

After tokenizer's template application tokenizer.apply_chat_template(chat, tokenize=False) it becomes

'<s>Source: system\n\n System prompt <step> Source: user\n\n First user query <step> Source: assistant\n\n Model response to first query <step> Source: user\n\n Second user query <step> Source: assistant\nDestination: user\n\n '

And the reference of generated token is:

reference_tokens = [1, 7562, 29901, 1788, 13, 13, 2184, 9508, 32015, 7562, 29901, 1404, 13, 13, 3824, 1404, 2346, 32015, 7562, 29901, 20255, 13, 13, 8125, 2933, 304, 937, 2346, 32015, 7562, 29901, 1404, 13, 13, 6440, 1404, 2346, 32015, 7562, 29901, 20255, 13, 14994, 3381, 29901, 1404, 13, 13, 29871]

However, in our implementation, even if the generated prompt is the same

'<s>Source: system\n\n System prompt <step> Source: user\n\n First user query <step> Source: assistant\n\n Model response to first query <step> Source: user\n\n Second user query <step> Source: assistant\nDestination: user\n\n '

The generated token has an extra token 1 in the beginning

[1, 1, 7562, 29901, 1788, 13, 13, 2184, 9508, 32015, 7562, 29901, 1404, 13, 13, 3824, 1404, 2346, 32015, 7562, 29901, 20255, 13, 13, 8125, 2933, 304, 937, 2346, 32015, 7562, 29901, 1404, 13, 13, 6440, 1404, 2346, 32015, 7562, 29901, 20255, 13, 14994, 3381, 29901, 1404, 13, 13, 29871]

The difference comes from the different usage of the add_special_tokens option when doing tokenizer encoding. The transformers lib's apply_chat_template function uses add_special_tokens=False as default when doing tokenization, code can be found here. This PR uses the same option default and can get the same generated token as official reference code.

Also it remains a question whether we should adopt apply_chat_template function as a potential way to do chat template application and tokenization using tokenizer directly.

CC @sunggg

sunggg

LGTM, thank you, @zxybazh! One minor comment for readability.

serve/mlc_serve/engine/model_module.py

zxybazh · 2024-02-13T01:10:54Z

Comments addressed, please take another look @sunggg

sunggg

Thanks for the quick reflection, @zxybazh!

This reverts commit 5b57390.

Revert "[MLS-272] Fix Special Token Encode Difference (#201)" This reverts commit 5b57390.

Fix special token difference.

2527d4e

zxybazh changed the title ~~[WIP] Fix Special Token Encode Difference~~ [MLS-272] Fix Special Token Encode Difference Feb 12, 2024

zxybazh marked this pull request as ready for review February 12, 2024 08:21

sunggg reviewed Feb 12, 2024

View reviewed changes

serve/mlc_serve/engine/model_module.py Outdated Show resolved Hide resolved

Fix comments.

55017f4

sunggg approved these changes Feb 13, 2024

View reviewed changes

sunggg merged commit 5b57390 into octoml:batch-serving Feb 13, 2024
1 check passed

sunggg added a commit that referenced this pull request Feb 13, 2024

Revert "[MLS-272] Fix Special Token Encode Difference (#201)"

840f007

This reverts commit 5b57390.

sunggg mentioned this pull request Feb 13, 2024

Revert "[MLS-272] Fix Special Token Encode Difference" #205

Merged

sunggg added a commit that referenced this pull request Feb 13, 2024

Revert "[MLS-272] Fix Special Token Encode Difference" (#205)

8282789

Revert "[MLS-272] Fix Special Token Encode Difference (#201)" This reverts commit 5b57390.

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

decode (octoml#201)

0422ab9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLS-272] Fix Special Token Encode Difference #201

[MLS-272] Fix Special Token Encode Difference #201

zxybazh commented Feb 11, 2024 •

edited

Loading

sunggg left a comment

zxybazh commented Feb 13, 2024

sunggg left a comment

[MLS-272] Fix Special Token Encode Difference #201

[MLS-272] Fix Special Token Encode Difference #201

Conversation

zxybazh commented Feb 11, 2024 • edited Loading

sunggg left a comment

Choose a reason for hiding this comment

zxybazh commented Feb 13, 2024

sunggg left a comment

Choose a reason for hiding this comment

zxybazh commented Feb 11, 2024 •

edited

Loading