[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

wangkl2 · 2025-01-16T12:48:06Z

Description

Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. This PR also aligns the parameters of llm service in all chatqna test scripts with what in readme file.

Issues

#1213

Type of change

New feature (non-breaking change which adds new functionality)
Others (enhancement, documentation, validation, etc.)

Dependencies

n/a

Tests

TGI-Gaudi version: 2.0.6
vLLM-fork version: 0.6.3.dev910+g3c39626f

Benchmark and compare the LLMServe perf on Gaudi2 server with OOB-vLLM and Tuned-TGI backend via GenAIEval. Below table shows the referenced geomean perf ratio on 7B LLM with 4 sets of ISL/OSL, measured on different num_queries and concurrency, including 32/8, 128/32. Leveraging vLLM as LLM backend shows 1.14X-6.39X perf speedup on 3 metrics but not significant perf drop on avg TTFT.

ISL/OSL, LLM Backend	Geomean of Normalized Avg Total Latency	Geomean of Normalized Avg TTFT	Geomean of Normalized Avg TPOT	Geomean of Normalized Output Tokens/s
128/128, TGI	1.00	1.00	1.00	1.00
128/128, vLLM	0.72	0.97	0.83	4.41
128/1024, TGI	1.00	1.00	1.00	1.00
128/1024, vLLM	0.97	0.94	0.97	5.10
1024/128, TGI	1.00	1.00	1.00	1.00
1024/128, vLLM	1.11	1.91	0.69	3.41
1024/1024, TGI	1.00	1.00	1.00	1.00
1024/1024, vLLM	0.77	1.01	0.87	21.76
Overall Geomean of Normalized Metric with vLLM	0.88	1.15	0.83	6.39
Overall Geomean Perf Speedup with vLLM over TGI	1/0.88=1.14X	1/1.15=0.87X	1/0.83=1.20X	6.39/1=6.39X

Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Signed-off-by: Wang, Kai Lawrence <[email protected]>

github-actions · 2025-01-16T12:48:20Z

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

for more information, see https://pre-commit.ci

Signed-off-by: Wang, Kai Lawrence <[email protected]>

…Examples into vllm-default-gaudi

…Examples into vllm-default-gaudi Signed-off-by: Wang, Kai Lawrence <[email protected]>

…Examples into vllm-default-gaudi

Signed-off-by: Wang, Kai Lawrence <[email protected]>

wangkl2 added 8 commits January 15, 2025 23:21

Switch to vllm llm backend for wo-rerank and guardrails pipe

2e5bc5c

Signed-off-by: Wang, Kai Lawrence <[email protected]>

update ut scripts for gaudi

978aaaf

Signed-off-by: Wang, Kai Lawrence <[email protected]>

solve conflicts

460cff6

Signed-off-by: Wang, Kai Lawrence <[email protected]>

solve conflicts

64fb275

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Update readme

c7fe0dd

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Resolve conflicts

e77faad

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Fix ci issues

eb5880c

Signed-off-by: Wang, Kai Lawrence <[email protected]>

wangkl2 requested review from lvliang-intel and letonghan as code owners January 16, 2025 12:48

pre-commit-ci bot and others added 6 commits January 16, 2025 12:48

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b78ed7

for more information, see https://pre-commit.ci

Fix ci issues

f1198f5

Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'vllm-default-gaudi' of https://github.com/wangkl2/GenAI…

13843ea

…Examples into vllm-default-gaudi

Merge branch 'vllm-default-gaudi' of https://github.com/wangkl2/GenAI…

97eb79b

…Examples into vllm-default-gaudi Signed-off-by: Wang, Kai Lawrence <[email protected]>

Merge branch 'vllm-default-gaudi' of https://github.com/wangkl2/GenAI…

4cafaa5

…Examples into vllm-default-gaudi

Fix ci issues for guardrails

8c976b7

Signed-off-by: Wang, Kai Lawrence <[email protected]>

joshuayao requested review from yao531441 and XinyuYe-Intel January 17, 2025 08:13

yao531441 approved these changes Jan 17, 2025

View reviewed changes

XinyuYe-Intel approved these changes Jan 17, 2025

View reviewed changes

chensuyue merged commit 00e9da9 into opea-project:main Jan 17, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

wangkl2 commented Jan 16, 2025

github-actions bot commented Jan 16, 2025 •

edited

Loading

[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

Conversation

wangkl2 commented Jan 16, 2025

Description

Issues

Type of change

Dependencies

Tests

github-actions bot commented Jan 16, 2025 • edited Loading

Dependency Review

Scanned Files

github-actions bot commented Jan 16, 2025 •

edited

Loading