[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

lvliang-intel · 2024-11-29T07:38:16Z

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Gaudi2

Running nodes

Single Node

Description

Feature Objective:
Set vLLM as the default serving framework on Gaudi to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.

Feature Details:

Replace TGI with vLLM as the default serving backend for inference on Gaudi devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Gaudi hardware.

Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Gaudi.

wangkl2 · 2025-01-07T04:10:03Z

Target only for ChatQnA example in v1.2.
Evaluate both on Gaudi2 and on Xeon.

joshuayao · 2025-01-13T05:48:55Z

@wangkl2 , could we mark it as Done?

wangkl2 · 2025-01-13T05:59:18Z

@joshuayao Still conducting the perf benchmark comparison and tests. After that, will create the PRs and link them here.

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement [opea-project#1213](https://github.com/opea-project/GenAIExamples/issues/) Signed-off-by: Wang, Kai Lawrence <[email protected]>

Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>

lvliang-intel added the feature New feature or request label Nov 29, 2024

lvliang-intel added this to the v1.2 milestone Nov 29, 2024

lvliang-intel assigned wangkl2 Nov 29, 2024

lvliang-intel added this to OPEA Nov 29, 2024

joshuayao moved this to In progress in OPEA Dec 1, 2024

joshuayao changed the title ~~[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework~~ [Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) Jan 7, 2025

This was referenced Jan 16, 2025

[ChatQnA] Switch to vLLM as default llm backend on Xeon #1403

Merged

[ChatQnA] Switch to vLLM as default llm backend on Gaudi #1404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

lvliang-intel commented Nov 29, 2024 •

edited

Loading

wangkl2 commented Jan 7, 2025

joshuayao commented Jan 13, 2025

wangkl2 commented Jan 13, 2025 •

edited

Loading

[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

Comments

lvliang-intel commented Nov 29, 2024 • edited Loading

Priority

OS type

Hardware type

Running nodes

Description

wangkl2 commented Jan 7, 2025

joshuayao commented Jan 13, 2025

wangkl2 commented Jan 13, 2025 • edited Loading

lvliang-intel commented Nov 29, 2024 •

edited

Loading

wangkl2 commented Jan 13, 2025 •

edited

Loading