Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set max concurrent requests #2961

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AllentDan
Copy link
Collaborator

No description provided.

@AllentDan
Copy link
Collaborator Author

With this setting:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     8405      
Benchmark duration (s):                  160.24    
Total input tokens:                      1950646   
Total generated tokens:                  1677697   
Total generated tokens (retokenized):    1677977   
Request throughput (req/s):              52.45     
Input token throughput (tok/s):          12173.05  
Output token throughput (tok/s):         10469.70  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   83107.15  
Median E2E Latency (ms):                 84316.44  
---------------Time to First Token----------------
Mean TTFT (ms):                          77769.83  
Median TTFT (ms):                        79269.43  
P99 TTFT (ms):                           149355.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          182.14    
Median TPOT (ms):                        24.02     
P99 TPOT (ms):                           4690.68   
---------------Inter-token Latency----------------
Mean ITL (ms):                           325.74    
Median ITL (ms):                         154.95    
P99 ITL (ms):                            2165.01   
==================================================

Without this setting:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     8005      
Benchmark duration (s):                  181.26    
Total input tokens:                      1854522   
Total generated tokens:                  1601078   
Total generated tokens (retokenized):    1591855   
Request throughput (req/s):              44.16     
Input token throughput (tok/s):          10231.09  
Output token throughput (tok/s):         8832.88   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   95447.55  
Median E2E Latency (ms):                 99336.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          90024.57  
Median TTFT (ms):                        89196.52  
P99 TTFT (ms):                           174165.09 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          333.14    
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           11715.11  
---------------Inter-token Latency----------------
Mean ITL (ms):                           539.17    
Median ITL (ms):                         0.01      
P99 ITL (ms):                            3571.25

@AllentDan AllentDan closed this Dec 26, 2024
@AllentDan AllentDan reopened this Jan 9, 2025
@@ -1035,6 +1049,12 @@ def serve(model_path: str,
proxy_url (str): The proxy url to register the api_server.
max_log_len (int): Max number of prompt characters or prompt tokens
being printed in log. Default: Unlimited
concurrency_pressure: This refers to the ratio between the maximum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this default value based on empirical experience?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just an empirical number.

super().__init__(app)
self.semaphore = asyncio.Semaphore(max_concurrent_requests)

async def dispatch(self, request: Request, call_next):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work? Any reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in BaseHTTPMiddleware.__call__

@@ -139,6 +139,16 @@ def add_parser_api_server():
type=str,
default=None,
help='The proxy url for api server.')
parser.add_argument(
'--concurrency-pressure',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-concurrency, default None, type int.
Let users decide whether to enable it or an appropriate value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @lzhangzz
any comments?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--max-concurrency, default None, type int. Let users decide whether to enable it or an appropriate value.

How about --max-concurrent-requests in TGI?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants