[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

ajaykallepalli · 2025-01-16T00:08:16Z

Priority

Undecided

OS type

Ubuntu

Hardware type

Xeon-other (Please let us know in description)

Installation method

Pull docker images from hub.docker.com
Build docker images from source

Deploy method

Docker compose
Docker
Kubernetes
Helm

Running nodes

Single Node

What's the version?

tgi-service:
    image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu

Description

Environment

Single machine deployment
Using docker-compose
Model: Intel/neural-chat-7b-v3-3
TGI version: 2.4.0-intel-cpu

Attempted Solutions

Increase context length option through argument, ultimately max of 8192 is not sufficient

Current configuration includes:

command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-tokens 4096 --max-total-tokens 8192

Questions

How can we increase the context length while staying with Intel/neural-chat-7b-v3-3?
Are there alternative models with similar resource requirements but longer context recommended for the TGI service?

Impact

Unable to get comprehensive responses from the model due to context length limitations.

Reproduce steps

Start Services

cd GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon
source ./set_env.sh
docker-compose up

Steps to Reproduce

Access the UI
Upload documents through the interface
Attempt to query with context that exceeds token limit

Error Log

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`

Raw log

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`

The text was updated successfully, but these errors were encountered:

xiguiw · 2025-01-16T03:49:54Z

Current configuration includes:
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-tokens 4096 --max-total-tokens 8192

@ajaykallepalli

Welcome to OPEA~

This configuration "--max-input-tokens 4096 --max-total-tokens 8192" limit the max input data within 4096 LLM tokens.
The log confirms it. It's not a bug.

2025-01-15T23:32:15.409004Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` must have less than 4096 tokens. Given: 4883
2025-01-15T23:33:44.379589Z ERROR chat_completions:async_stream:generate_stream: text_generation_router::infer: router/src/infer/mod.rs:114: `inputs` tokens + `max_new_tokens` must be <= 8192. Given: 12032 `inputs` tokens and 1024 `max_new_tokens`

Questions

How can we increase the context length while staying with Intel/neural-chat-7b-v3-3?
You can set a larger token, for example for your case "Given: 12032 inputs tokens and 1024 max_new_tokens"
You input token is 12032, you can set "max-input-tokens 13000 --max-total-tokens 14024"

Please make sure the check the model capability of context window to make sure model support the max max-total-tokens.
For For "Intel/neural-chat-7b-v3-3", it is "max_position_embeddings": 32768. You can get it from model config.json for Transformers LLM model.

Are there alternative models with similar resource requirements but longer context recommended for the TGI service?
You can find more model resource from Huggingface

ajaykallepalli · 2025-01-16T04:06:45Z

@xiguiw thanks for the prompt reply, the solution of setting --max-input tokens to anything higher than 4096 gives this error:

 tgi-service                    | ValueError: The backend ipex does not support sliding window attention that is 
tgi-service                    | used by the model type mistral. To use this model nonetheless with the ipex 
tgi-service                    | backend, please launch TGI with the argument `--max-input-tokens` smaller than 
tgi-service                    | sliding_window=4096 (got here max_input_tokens=13000). rank=0
tgi-service                    | 2025-01-16T03:59:29.439705Z ERROR text_generation_launcher: Shard 0 failed to start

xiguiw · 2025-01-17T02:29:30Z

@ajaykallepalli

This is inference backend issue, it's out of OPEA scope.
Please try vLLM inference.

ajaykallepalli added the bug Something isn't working label Jan 16, 2025

xiguiw self-assigned this Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

ajaykallepalli commented Jan 16, 2025

xiguiw commented Jan 16, 2025

Questions

ajaykallepalli commented Jan 16, 2025

xiguiw commented Jan 17, 2025

[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

[Question] Resolving the context length issue in tgi-service when running ChatQNA #1394

Comments

ajaykallepalli commented Jan 16, 2025

Priority

OS type

Hardware type

Installation method

Deploy method

Running nodes

What's the version?

Description

Environment

Attempted Solutions

Questions

Impact

Reproduce steps

Start Services

Steps to Reproduce

Error Log

Raw log

xiguiw commented Jan 16, 2025

Questions

ajaykallepalli commented Jan 16, 2025

xiguiw commented Jan 17, 2025