Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213

Open
lvliang-intel opened this issue Nov 29, 2024 · 3 comments
Assignees
Labels
feature New feature or request
Milestone

Comments

@lvliang-intel
Copy link
Collaborator

lvliang-intel commented Nov 29, 2024

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Gaudi2

Running nodes

Single Node

Description

Feature Objective:
Set vLLM as the default serving framework on Gaudi to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.

Feature Details:

Replace TGI with vLLM as the default serving backend for inference on Gaudi devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Gaudi hardware.

Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Gaudi.

@lvliang-intel lvliang-intel added the feature New feature or request label Nov 29, 2024
@lvliang-intel lvliang-intel added this to the v1.2 milestone Nov 29, 2024
@joshuayao joshuayao moved this to In progress in OPEA Dec 1, 2024
@wangkl2
Copy link
Collaborator

wangkl2 commented Jan 7, 2025

  • Target only for ChatQnA example in v1.2.
  • Evaluate both on Gaudi2 and on Xeon.

@joshuayao joshuayao changed the title [Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework [Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) Jan 7, 2025
@joshuayao
Copy link
Collaborator

@wangkl2 , could we mark it as Done?

@wangkl2
Copy link
Collaborator

wangkl2 commented Jan 13, 2025

@joshuayao Still conducting the perf benchmark comparison and tests. After that, will create the PRs and link them here.

wangkl2 added a commit to wangkl2/GenAIExamples that referenced this issue Jan 15, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB.

Implement opea-project#1213

Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2 added a commit to wangkl2/GenAIExamples that referenced this issue Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails.

Implement opea-project#1213

Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2 added a commit to wangkl2/GenAIExamples that referenced this issue Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails.

Implement [opea-project#1213](https://github.com/opea-project/GenAIExamples/issues/)

Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2 added a commit to wangkl2/GenAIExamples that referenced this issue Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails.

Implement opea-project#1213

Signed-off-by: Wang, Kai Lawrence <[email protected]>
chensuyue pushed a commit that referenced this issue Jan 17, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. 

#1213
Signed-off-by: Wang, Kai Lawrence <[email protected]>
chensuyue pushed a commit that referenced this issue Jan 17, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf.

#1213
Signed-off-by: Wang, Kai Lawrence <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
Status: In progress
Development

No branches or pull requests

3 participants