-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] what to do when model doesn't have tokenizer.model
?
#2212
Comments
In your case specifically, you can use the original Llama 3.2 1B tokenizer.model from https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct (if the unsloth version is based off the instruct model, use the base one otherwise). If unsloth modified any of the special tokens, then you will need a new I don't believe you can load in the tokenizer without the |
@RdoubleA Thanks for explain, got the case. deepseek-ai/DeepSeek-V3 I don't have any idea to what should be done here. |
@joecummings @RdoubleA I have faced this while working on Phi4 PR. There are several solutions about It, but would love to get comments from you firstly. |
So if I understand correctly, this is basically a function of torchtune not integrating with the Hugging Face tokenizers library, correct? In most of the examples listed above, I believe there are tokenizer.json and tokenizer_config.json files that are used by HF to build the tokenizer. I think we could consider building a utility to parse a given HF tokenizer and wrap into a format that is compatible with torchtune. This would require a fair bit of discussion though as there are a lot of details we'd need to iron out. cc @joecummings @RdoubleA for your thoughts |
@krammnic Took a look at your PR. I agree we need a better solution here. We are working on integrating with HF better so it's easier to port over new models, tokenizers being a major pain point. A few options:
The other thing to consider is, once a new model tokenizer is added we don't need to "convert" from HF anymore because users can just instantiate the added model tokenizer. Or maybe we'll just need to load from some base tokenizer.model each time. Open to other solutions. |
while
tokenizer.model
is required in yaml config, but there are many models that doesn't havetokenizer.model
(example: unsloth/Llama-3.2-1B)In these cases, how can we use
tokenizer.json
ortokenizer_config.json
that are included in almost all model instead oftokenizer.model
?The text was updated successfully, but these errors were encountered: