Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

Open
2 of 6 tasks
ajaykallepalli opened this issue Jan 16, 2025 · 3 comments
Open
2 of 6 tasks
Assignees
Labels
bug Something isn't working

Comments

@ajaykallepalli
Copy link

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-other (Please let us know in description)

Installation method

  • Pull docker images from hub.docker.com
  • Build docker images from source

Deploy method

  • Docker compose
  • Docker
  • Kubernetes
  • Helm

Running nodes

Single Node

What's the version?

tgi-service:
    image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu

Description

Environment

  • Single machine deployment
  • Using docker-compose
  • Model: Intel/neural-chat-7b-v3-3
  • TGI version: 2.4.0-intel-cpu

Attempted Solutions

Increase context length option through argument, ultimately max of 8192 is not sufficient

  • Current configuration includes:
    command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-tokens 4096 --max-total-tokens 8192

Questions

  1. How can we increase the context length while staying with Intel/neural-chat-7b-v3-3?
  2. Are there alternative models with similar resource requirements but longer context recommended for the TGI service?

Impact

Unable to get comprehensive responses from the model due to context length limitations.

Reproduce steps

Start Services

cd GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon
source ./set_env.sh
docker-compose up

Steps to Reproduce

  1. Access the UI
  2. Upload documents through the interface
  3. Attempt to query with context that exceeds token limit

Error Log

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`

Raw log

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`
@ajaykallepalli ajaykallepalli added the bug Something isn't working label Jan 16, 2025
@xiguiw xiguiw self-assigned this Jan 16, 2025
@xiguiw
Copy link
Collaborator

xiguiw commented Jan 16, 2025

  • Current configuration includes:
    command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-tokens 4096 --max-total-tokens 8192

@ajaykallepalli

Welcome to OPEA~

This configuration "--max-input-tokens 4096 --max-total-tokens 8192" limit the max input data within 4096 LLM tokens.
The log confirms it. It's not a bug.

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`

Questions

  1. How can we increase the context length while staying with Intel/neural-chat-7b-v3-3?
    You can set a larger token, for example for your case "Given: 12032 inputs tokens and 1024 max_new_tokens"
    You input token is 12032, you can set "max-input-tokens 13000 --max-total-tokens 14024"

Please make sure the check the model capability of context window to make sure model support the max max-total-tokens.
For For "Intel/neural-chat-7b-v3-3", it is "max_position_embeddings": 32768. You can get it from model config.json for Transformers LLM model.

  1. Are there alternative models with similar resource requirements but longer context recommended for the TGI service?
    You can find more model resource from Huggingface

@ajaykallepalli
Copy link
Author

@xiguiw thanks for the prompt reply, the solution of setting --max-input tokens to anything higher than 4096 gives this error:

 tgi-service                    | ValueError: The backend ipex does not support sliding window attention that is 
tgi-service                    | used by the model type mistral. To use this model nonetheless with the ipex 
tgi-service                    | backend, please launch TGI with the argument `--max-input-tokens` smaller than 
tgi-service                    | sliding_window=4096 (got here max_input_tokens=13000). rank=0
tgi-service                    | 2025-01-16T03:59:29.439705Z ERROR text_generation_launcher: Shard 0 failed to start

@xiguiw
Copy link
Collaborator

xiguiw commented Jan 17, 2025

@ajaykallepalli

This is inference backend issue, it's out of OPEA scope.
Please try vLLM inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants