-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA) #1213
Comments
|
joshuayao
changed the title
[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework
[Feature] Enhance Gaudi Performance by Adopting vLLM as Default Serving framework (ChatQnA)
Jan 7, 2025
@wangkl2 , could we mark it as Done? |
@joshuayao Still conducting the perf benchmark comparison and tests. After that, will create the PRs and link them here. |
wangkl2
added a commit
to wangkl2/GenAIExamples
that referenced
this issue
Jan 15, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Xeon server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TTFT, average TPOT and throughput. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline, pinecone as the vectorDB, qdrant as the vectorDB. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2
added a commit
to wangkl2/GenAIExamples
that referenced
this issue
Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2
added a commit
to wangkl2/GenAIExamples
that referenced
this issue
Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement [opea-project#1213](https://github.com/opea-project/GenAIExamples/issues/) Signed-off-by: Wang, Kai Lawrence <[email protected]>
wangkl2
added a commit
to wangkl2/GenAIExamples
that referenced
this issue
Jan 16, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. Via benchmarking on Gaudi2 server with vLLM and TGI backend for LLM component for different ISL/OSL and various number of queries and concurrency, the geomean of measured LLMServe perf on a 7B model shows perf improvement of vLLM over TGI on several metrics including average total latency, average TPOT and throughput, while the geomean of average TTFT does not increase significantly. TGI is still offered as an option to deploy for LLM serving. Besides, vLLM LLM also replaces TGI LLM for other provided E2E ChatQnA pipelines including without-rerank pipeline and megaservice with guardrails. Implement opea-project#1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>
This was referenced Jan 16, 2025
chensuyue
pushed a commit
that referenced
this issue
Jan 17, 2025
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>
chensuyue
pushed a commit
that referenced
this issue
Jan 17, 2025
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Priority
P1-Stopper
OS type
Ubuntu
Hardware type
Gaudi2
Running nodes
Single Node
Description
Feature Objective:
Set vLLM as the default serving framework on Gaudi to leverage its optimized performance characteristics, thereby improving throughput and reducing latency in inference tasks.
Feature Details:
Replace TGI with vLLM as the default serving backend for inference on Gaudi devices.
Update serving configurations to align with vLLM's architecture for inference.
Perform performance benchmarking to validate vLLM's superiority in terms of TTFT, TPOT and scalability on Gaudi hardware.
Expected Outcome:
Adopting vLLM as the default framework improves the user experience by significantly lowering latency while exceeding the current TGI throughput levels on Gaudi.
The text was updated successfully, but these errors were encountered: