Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042

Open
Crista23 opened this issue Oct 5, 2024 · 9 comments

Comments

@Crista23
Copy link

Crista23 commented Oct 5, 2024

My code is throwing the error below:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/net/scratch/user/miniconda3/envs/vllm_guidance/lib/python3.9/site-packages/guidance/models/transformers/_transformers.py:150: UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]
 warnings.warn(
Warning: lexer error: too many states: 10406 >= 10000; stopping

I can see this error is thrown in the code here
https://github.com/guidance-ai/guidance/blob/main/guidance/models/transformers/_transformers.py#L233
and it looks like it's a tokenizer issue, however I am calling the guidance library without specifying a tokenizer

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

I am wondering how to fix this. Any advice appreciated, thanks!

@Harsha-Nori
Copy link
Collaborator

Hi @Crista23, sorry you're dealing with this! Which version of the package are you using? Are you on our release candidate / installing from source?

Even if a tokenizer isn't explicitly specified, we do need one for guidance to work properly. For transformers based models, we try to load it automatically from the model config. However, sometimes this can act up, especially if there are new tokens added to a model's vocabulary via fine tuning (and not updated in the config...).

Are you using a public/oss model? Do you mind sharing the link to it so that we can try to debug it on our side?

@Crista23
Copy link
Author

Crista23 commented Oct 6, 2024

HI @Harsha-Nori , thanks a lot for your answer! I have installed guidance --pre using pip and the version installed is 0.2.0rc1. I am using this in combination with publicly available models such as LLAMA-8B-Instruct instantiated in the code using

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

outputs = llm + prompt_eval(item[key])

It has worked for a couple examples until it crashed with this error: "Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping". It looks like a tokenizer issue and even though I tried to replace "!" with the empty string in the input it still fails.

I would appreciate your thoughts on how to fix this, thank you!

@Crista23
Copy link
Author

Crista23 commented Oct 8, 2024

@Harsha-Nori Any thoughts? Sorry to ask again, it's a pressing issue.

@Harsha-Nori
Copy link
Collaborator

Hi @Crista23, I can't seem to replicate this with a llama-8B model :(. Could you share some more details about your code, including the exact huggingface model and/or details of the prompt_eval method?

The error message can happen if the grammar you're constraining against is particularly complex, but I can't seem to replicate it on my side :(. Happy to also collaborate via email if you can't share publicly.

@hudson-ai
Copy link
Collaborator

@Crista23 if you can't share details of your prompt, would you be able to share the full traceback? Thanks!

@jtbuter
Copy link

jtbuter commented Oct 18, 2024

@Harsha-Nori @hudson-ai I get a similar warning when initializing the llama 8b instruct model with guidance 0.1.16 and transformers 4.45.2

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import guidance.models
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

llama3 = guidance.models.Transformers(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The warning is the following
UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]

Can it be because the tokenizer encodes ! as [128000, 0], which decodes to <|begin_of_text|>!? Therefore this check fails?

if len(encoded_str) != 1:
    raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}")```

@hudson-ai
Copy link
Collaborator

@jtbuter thanks for the repro -- I am able to reproduce the warning with transformers 4.45.2 (interestingly, not with my previously installed 4.44.0 version). We have a few methods for converting tokens into a form that we need in order to support constrained decoding, and the warning here is just saying that our preferred approach is failing and falling back to an alternative approach. Will definitely look into what's going on under the hood here -- thank you for the suggestion on where to look. I think you have the right idea.

Are you experiencing any downstream problems after seeing this warning?

This being said, the lexer error: too many states: 10406 >= 10000; stopping that @Crista23 is seeing "shouldn't" be caused by this -- it seems to be that our parser is finding the particular grammar they are constraining against to be disagreeable for whatever reason. I've seen something similar happen in some grammars where the parse tree is exceptionally ambiguous.

@Crista23 are you able to share any details about the constraints you are using? I would love to see us (1) improve robustness and (2) provide more helpful exceptions and warnings. A concrete example of what's causing this would really help to that end.

@jtbuter
Copy link

jtbuter commented Oct 21, 2024

Thank you for the reply, I was not experiencing any other problems after this warning

@guillaume-requena
Copy link

Hello @hudson-ai, @Harsha-Nori, I am having the same lexer state error as @Crista23 :

ValueError: lexer error: too many states: 10882 >= 10000

As I see this Issue is still open, I'll ask for details here.

Here is what my piece of code looks like using models and gen from guidance, as simple as that.

language_model = models.Transformers(
     "google/gemma-2-9b-it",
     device_map="auto",
     max_memory={0: '30GiB', 'cpu': "80GiB"}
)
result = language_model + f'''
        Q: {prompt}
        A: {gen(
            name="answer",
            max_tokens=200
        )}
        '''

This error happens on inference over big prompts (more than 3600 tokens roughly if you want to reproduce). It looks like the grammar has a size limit, but I can't find where to change the 10k state limitation.. See stacktrace below. Is there a way to bypass this limitation while still using guidance ? (using subgrammar or smth else)

...
File ".venv/lib/python3.10/site-packages/guidance/_parser.py", line 78, in advance
    return self._generator.send(engine_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".venv/lib/python3.10/site-packages/guidance/_parser.py", line 153, in _parse
    backtrack, ff_tokens = self.ll_interpreter.commit_token(
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: lexer error: too many states: 10882 >= 10000

Thanks for your help !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants