Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Token indices sequence length is longer than the specified maximum sequence length for this model (614 > 512). Running this sequence through the model will result in indexing errors #165

Open
lifengyu2005 opened this issue Jun 19, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@lifengyu2005
Copy link

Describe the issue

I use the following configuration, why is it throwing an error? I see a lot of 512 configurations in the llmlingua installation path. Do I need to retrain the model, or is it an issue with the llmlingua version?

self.model_compress = PromptCompressor(
model_name="/xxx/llmlingua/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True, # Whether to use llmlingua-2
llmlingua2_config={
"max_batch_size": 100,
"max_force_token": 4096,
},
)

llmlingua ver 0.2.2

@lifengyu2005 lifengyu2005 added the question Further information is requested label Jun 19, 2024
@iofu728
Copy link
Contributor

iofu728 commented Jun 20, 2024

Hi @lifengyu2005, thanks for your support. These logs appear to be warnings. Did your program crash because of these warnings? Please provide more details to help us identify the issue.

@cornzz
Copy link

cornzz commented Sep 11, 2024

@lifengyu2005 This warning comes from the tokenizer, not the model itself, you can reproduce this as shown below.
The model used in LLMLingua-2 can only handle input lengths of up to 512 tokens, the compressor divides the prompt into 512 token length chunks and compresses those chunks separately. So even when you see this warning, everything is working as intended.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/llmlingua-2-xlm-roberta-large-meetingbank")
tokens = tokenizer.encode("Loooong prompt...")
# Output: Token indices sequence length is longer than the specified maximum sequence length for this model (1500 > 512). Running this sequence through the model will result in indexing errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants