A question from beginner: How Chinese chars converted in the vocab? #19

Adenialzz · 2024-12-24T03:04:56Z

I notice that in deepseek (llama) vocab, there is no Chinese chars in that, instead of something like åĲ¦, and that can be decoded to Chinese 否. I wonder how this happened. I have checked the source code in transformers and did not find the implementation about this.

(Pdb) mllm.tokenizer.convert_ids_to_tokens(3636)
'åĲ¦'
(Pdb) mllm.tokenizer.decode([3636])
'否'
(Pdb)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question from beginner: How Chinese chars converted in the vocab? #19

A question from beginner: How Chinese chars converted in the vocab? #19

Adenialzz commented Dec 24, 2024

A question from beginner: How Chinese chars converted in the vocab? #19

A question from beginner: How Chinese chars converted in the vocab? #19

Comments

Adenialzz commented Dec 24, 2024