Cancel hung requests #93

masahi · 2023-11-30T11:32:35Z

A follow-up to #91. Instead of merely emitting a warning upon a hang detection, we should just cancel hung requests.

I confirmed that the detection works on the test case added in #89 when the engine changes from that PR are removed (thus manually causing a hang). Please merge #89 first.

I didn't update SyncEngine for now since it has a different cancellation logic than the staging engine and I think an additional complexity is not worth it given its limited use (mostly for debugging).

@jroesch @elvin-n @sunggg

elvin-n

Looks good

serve/mlc_serve/engine/staging_engine_worker.py

7andrew7 · 2023-11-30T15:37:54Z

serve/mlc_serve/engine/staging_engine_worker.py

@@ -318,6 +324,25 @@ def _adjust_batch(self):
                self.cache_manager.allocate(state.request_id, num_tokens)
                self.current_batch[state.request_id] = state

+            if not self.current_batch:


Do we understand why this would occur? My concern is that the container could get into a state where it continuously reject all requests.

I'd say we don't. If we knew, we would eliminate such possibility. We tried our best in #89 but we can never be 100% sure that a hang would never occur.

sunggg

Thank you @masahi and @elvin-n for tackling very important and tricky issue.

I guess main problem we are seeing is engine idling when dealing with long sequence request - we kick it out from the batch when the current context length of a sequence (prompt+gen tokens) is above max_num_batched_tokens and we repeatedly try to put it back into the batch but fail.

To prevent this engine idling, I'm wondering if we can make the following statement true:
All requests that engine cannot handle should be canceled properly. To achieve this, we need the following sub-statements.
S1. If a request has no chance to be processed at all (e.g., prompt len > max_num_batched_tokens), it should not enter the queue at first place and we respond users immediately saying it is canceled due to engine limit.
S2. If a request is turned out to be exceeding the max_num_batched_tokens during the generation, it should be canceled immediately.

I think we have I1 already but not I2. Wonder what you guys think.

serve/mlc_serve/engine/staging_engine_worker.py

sunggg · 2023-12-01T22:22:11Z

serve/mlc_serve/engine/staging_engine_worker.py

+                    len(self.current_batch) == 0
+                    and num_new_batched_tokens > self.max_num_batched_tokens
+                ):
+                    state.token_ids = state.token_ids[: self.max_num_batched_tokens]


if we already reach the max_num_batched_tokens, is there any chance for further generation? I'm wondering if we should just return the response immediately.

We need to define what max_num_batched_tokens stands for. Currently it limits the maximum tokens to be processed simultaneously. If we find how to handle this, we can decode until context len. I do not see reasons to cancel, at least just because we achieved max_num_batched_tokens limit.

Agreed with @elvin-n, after we recover from cache eviction, discard most recent cache entries, and refill the cache for the first max_num_batched_tokens, decode can proceed as long as there are free space in the cache.

But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...

We need to define what max_num_batched_tokens stands for. Currently it limits the maximum tokens to be processed simultaneously.

Yes, exactly. So if the context length (a.k.a., len(state.token_ids)) of the request reaches the max_num_batched_tokens, in the next engine step, I guess the engine wouldn't select the sequence to put it in the batch since it already reached the batched token limit? So I suspect it will just sit in the queue.

But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...

Yes, great point. to prevent this kind of case, seems like we should also consider free space in S2 above.

elvin-n suggested changes Nov 30, 2023

View reviewed changes

serve/mlc_serve/engine/staging_engine_worker.py Outdated Show resolved Hide resolved

7andrew7 reviewed Nov 30, 2023

View reviewed changes

masahi force-pushed the cancel-hung-request branch from c7e4c60 to c226fc4 Compare November 30, 2023 19:11

masahi marked this pull request as ready for review November 30, 2023 20:11

elvin-n approved these changes Dec 1, 2023

View reviewed changes

sunggg reviewed Dec 1, 2023

View reviewed changes

masahi force-pushed the cancel-hung-request branch from 12adafc to 452c3ef Compare December 4, 2023 10:40

masahi mentioned this pull request Dec 6, 2023

Misc clean up, doc improvement #98

Merged

Cancel hung requests

0d86159

masahi force-pushed the cancel-hung-request branch from abca5ce to 0d86159 Compare December 6, 2023 10:28

masahi closed this Dec 13, 2023

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

[Hotfix] Fix the bug in modules.py (octoml#93)

fcacd2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel hung requests #93

Cancel hung requests #93

masahi commented Nov 30, 2023 •

edited

Loading

elvin-n left a comment

7andrew7 Nov 30, 2023 •

edited

Loading

masahi Nov 30, 2023 •

edited

Loading

sunggg left a comment

sunggg Dec 1, 2023

elvin-n Dec 4, 2023

masahi Dec 4, 2023 •

edited

Loading

masahi Dec 4, 2023

sunggg Dec 4, 2023

Cancel hung requests #93

Cancel hung requests #93

Conversation

masahi commented Nov 30, 2023 • edited Loading

elvin-n left a comment

Choose a reason for hiding this comment

7andrew7 Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

masahi Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

sunggg left a comment

Choose a reason for hiding this comment

sunggg Dec 1, 2023

Choose a reason for hiding this comment

elvin-n Dec 4, 2023

Choose a reason for hiding this comment

masahi Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

masahi Dec 4, 2023

Choose a reason for hiding this comment

sunggg Dec 4, 2023

Choose a reason for hiding this comment

masahi commented Nov 30, 2023 •

edited

Loading

7andrew7 Nov 30, 2023 •

edited

Loading

masahi Nov 30, 2023 •

edited

Loading

masahi Dec 4, 2023 •

edited

Loading