Set max concurrent requests #2961

AllentDan · 2024-12-26T09:40:10Z

No description provided.

AllentDan · 2024-12-26T09:42:03Z

With this setting:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     8405      
Benchmark duration (s):                  160.24    
Total input tokens:                      1950646   
Total generated tokens:                  1677697   
Total generated tokens (retokenized):    1677977   
Request throughput (req/s):              52.45     
Input token throughput (tok/s):          12173.05  
Output token throughput (tok/s):         10469.70  
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   83107.15  
Median E2E Latency (ms):                 84316.44  
---------------Time to First Token----------------
Mean TTFT (ms):                          77769.83  
Median TTFT (ms):                        79269.43  
P99 TTFT (ms):                           149355.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          182.14    
Median TPOT (ms):                        24.02     
P99 TPOT (ms):                           4690.68   
---------------Inter-token Latency----------------
Mean ITL (ms):                           325.74    
Median ITL (ms):                         154.95    
P99 ITL (ms):                            2165.01   
==================================================

Without this setting:

============ Serving Benchmark Result ============
Backend:                                 lmdeploy  
Traffic request rate:                    inf       
Successful requests:                     8005      
Benchmark duration (s):                  181.26    
Total input tokens:                      1854522   
Total generated tokens:                  1601078   
Total generated tokens (retokenized):    1591855   
Request throughput (req/s):              44.16     
Input token throughput (tok/s):          10231.09  
Output token throughput (tok/s):         8832.88   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   95447.55  
Median E2E Latency (ms):                 99336.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          90024.57  
Median TTFT (ms):                        89196.52  
P99 TTFT (ms):                           174165.09 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          333.14    
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           11715.11  
---------------Inter-token Latency----------------
Mean ITL (ms):                           539.17    
Median ITL (ms):                         0.01      
P99 ITL (ms):                            3571.25

lvhan028 · 2025-01-13T02:20:43Z

lmdeploy/serve/openai/api_server.py

@@ -1035,6 +1049,12 @@ def serve(model_path: str,
        proxy_url (str): The proxy url to register the api_server.
        max_log_len (int): Max number of prompt characters or prompt tokens
            being printed in log. Default: Unlimited
+        concurrency_pressure: This refers to the ratio between the maximum


Is this default value based on empirical experience?

Yes, just an empirical number.

lvhan028 · 2025-01-13T02:22:09Z

lmdeploy/serve/openai/api_server.py

+        super().__init__(app)
+        self.semaphore = asyncio.Semaphore(max_concurrent_requests)
+
+    async def dispatch(self, request: Request, call_next):


How does it work? Any reference?

It is used in BaseHTTPMiddleware.__call__

lvhan028 · 2025-01-21T06:48:50Z

lmdeploy/cli/serve.py

@@ -139,6 +139,16 @@ def add_parser_api_server():
                            type=str,
                            default=None,
                            help='The proxy url for api server.')
+        parser.add_argument(
+            '--concurrency-pressure',


--max-concurrency, default None, type int.
Let users decide whether to enable it or an appropriate value.

cc @lzhangzz
any comments?

--max-concurrency, default None, type int. Let users decide whether to enable it or an appropriate value.

How about --max-concurrent-requests in TGI?

Conflicts: lmdeploy/cli/serve.py

AllentDan added 2 commits December 26, 2024 17:08

Add max_concurrent_requests

13e9dbc

add cli

6d4ce75

AllentDan closed this Dec 26, 2024

AllentDan reopened this Jan 9, 2025

lvhan028 reviewed Jan 13, 2025

View reviewed changes

lvhan028 requested a review from lzhangzz January 13, 2025 02:22

This was referenced Jan 13, 2025

Use aiohttp inside proxy server && add --disable-cache-status argument #3020

Open

Is there a way to limit max concurrent requests? #3019

Open

lvhan028 reviewed Jan 21, 2025

View reviewed changes

AllentDan added 3 commits January 21, 2025 17:13

Merge branch 'main' into max_concurrent_requests

ff23c4a

Conflicts: lmdeploy/cli/serve.py

fix

d71021b

fix

b0f8e95

lvhan028 added the improvement label Jan 21, 2025

lvhan028 approved these changes Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set max concurrent requests #2961

Set max concurrent requests #2961

AllentDan commented Dec 26, 2024

AllentDan commented Dec 26, 2024

lvhan028 Jan 13, 2025

AllentDan Jan 13, 2025

lvhan028 Jan 13, 2025

AllentDan Jan 13, 2025

lvhan028 Jan 21, 2025

lvhan028 Jan 21, 2025

AllentDan Jan 21, 2025

lvhan028 Jan 21, 2025

Set max concurrent requests #2961

Are you sure you want to change the base?

Set max concurrent requests #2961

Conversation

AllentDan commented Dec 26, 2024

AllentDan commented Dec 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment