Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancel hung requests #93

Closed
wants to merge 1 commit into from

Conversation

masahi
Copy link
Member

@masahi masahi commented Nov 30, 2023

A follow-up to #91. Instead of merely emitting a warning upon a hang detection, we should just cancel hung requests.

I confirmed that the detection works on the test case added in #89 when the engine changes from that PR are removed (thus manually causing a hang). Please merge #89 first.

I didn't update SyncEngine for now since it has a different cancellation logic than the staging engine and I think an additional complexity is not worth it given its limited use (mostly for debugging).

@jroesch @elvin-n @sunggg

Copy link

@elvin-n elvin-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

serve/mlc_serve/engine/staging_engine_worker.py Outdated Show resolved Hide resolved
@@ -318,6 +324,25 @@ def _adjust_batch(self):
self.cache_manager.allocate(state.request_id, num_tokens)
self.current_batch[state.request_id] = state

if not self.current_batch:
Copy link

@7andrew7 7andrew7 Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we understand why this would occur? My concern is that the container could get into a state where it continuously reject all requests.

Copy link
Member Author

@masahi masahi Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say we don't. If we knew, we would eliminate such possibility. We tried our best in #89 but we can never be 100% sure that a hang would never occur.

@masahi masahi force-pushed the cancel-hung-request branch from c7e4c60 to c226fc4 Compare November 30, 2023 19:11
@masahi masahi marked this pull request as ready for review November 30, 2023 20:11
Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @masahi and @elvin-n for tackling very important and tricky issue.

I guess main problem we are seeing is engine idling when dealing with long sequence request - we kick it out from the batch when the current context length of a sequence (prompt+gen tokens) is above max_num_batched_tokens and we repeatedly try to put it back into the batch but fail.

To prevent this engine idling, I'm wondering if we can make the following statement true:
All requests that engine cannot handle should be canceled properly. To achieve this, we need the following sub-statements.
S1. If a request has no chance to be processed at all (e.g., prompt len > max_num_batched_tokens), it should not enter the queue at first place and we respond users immediately saying it is canceled due to engine limit.
S2. If a request is turned out to be exceeding the max_num_batched_tokens during the generation, it should be canceled immediately.

I think we have I1 already but not I2. Wonder what you guys think.

serve/mlc_serve/engine/staging_engine_worker.py Outdated Show resolved Hide resolved
serve/mlc_serve/engine/staging_engine_worker.py Outdated Show resolved Hide resolved
len(self.current_batch) == 0
and num_new_batched_tokens > self.max_num_batched_tokens
):
state.token_ids = state.token_ids[: self.max_num_batched_tokens]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already reach the max_num_batched_tokens, is there any chance for further generation? I'm wondering if we should just return the response immediately.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what max_num_batched_tokens stands for. Currently it limits the maximum tokens to be processed simultaneously. If we find how to handle this, we can decode until context len. I do not see reasons to cancel, at least just because we achieved max_num_batched_tokens limit.

Copy link
Member Author

@masahi masahi Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @elvin-n, after we recover from cache eviction, discard most recent cache entries, and refill the cache for the first max_num_batched_tokens, decode can proceed as long as there are free space in the cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define what max_num_batched_tokens stands for. Currently it limits the maximum tokens to be processed simultaneously.

Yes, exactly. So if the context length (a.k.a., len(state.token_ids)) of the request reaches the max_num_batched_tokens, in the next engine step, I guess the engine wouldn't select the sequence to put it in the batch since it already reached the batched token limit? So I suspect it will just sit in the queue.

But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...

Yes, great point. to prevent this kind of case, seems like we should also consider free space in S2 above.

@masahi masahi force-pushed the cancel-hung-request branch from abca5ce to 0d86159 Compare December 6, 2023 10:28
@masahi masahi closed this Dec 13, 2023
Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants