-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cancel hung requests #93
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@@ -318,6 +324,25 @@ def _adjust_batch(self): | |||
self.cache_manager.allocate(state.request_id, num_tokens) | |||
self.current_batch[state.request_id] = state | |||
|
|||
if not self.current_batch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we understand why this would occur? My concern is that the container could get into a state where it continuously reject all requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we don't. If we knew, we would eliminate such possibility. We tried our best in #89 but we can never be 100% sure that a hang would never occur.
c7e4c60
to
c226fc4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @masahi and @elvin-n for tackling very important and tricky issue.
I guess main problem we are seeing is engine idling when dealing with long sequence request - we kick it out from the batch when the current context length of a sequence (prompt+gen tokens) is above max_num_batched_tokens
and we repeatedly try to put it back into the batch but fail.
To prevent this engine idling, I'm wondering if we can make the following statement true:
All requests that engine cannot handle should be canceled properly. To achieve this, we need the following sub-statements.
S1. If a request has no chance to be processed at all (e.g., prompt len > max_num_batched_tokens
), it should not enter the queue at first place and we respond users immediately saying it is canceled due to engine limit.
S2. If a request is turned out to be exceeding the max_num_batched_tokens
during the generation, it should be canceled immediately.
I think we have I1 already but not I2. Wonder what you guys think.
len(self.current_batch) == 0 | ||
and num_new_batched_tokens > self.max_num_batched_tokens | ||
): | ||
state.token_ids = state.token_ids[: self.max_num_batched_tokens] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already reach the max_num_batched_tokens
, is there any chance for further generation? I'm wondering if we should just return the response immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define what max_num_batched_tokens
stands for. Currently it limits the maximum tokens to be processed simultaneously. If we find how to handle this, we can decode until context len. I do not see reasons to cancel, at least just because we achieved max_num_batched_tokens
limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @elvin-n, after we recover from cache eviction, discard most recent cache entries, and refill the cache for the first max_num_batched_tokens
, decode can proceed as long as there are free space in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define what max_num_batched_tokens stands for. Currently it limits the maximum tokens to be processed simultaneously.
Yes, exactly. So if the context length (a.k.a., len(state.token_ids)
) of the request reaches the max_num_batched_tokens
, in the next engine step, I guess the engine wouldn't select the sequence to put it in the batch since it already reached the batched token limit? So I suspect it will just sit in the queue.
But I just wondered about the following case: If there is only sequence in the batch and there is no entries in the cache for other previous sequences, what happens in the generation stage if we max out the free space before that single sequence finishes? We cannot evict since we end up in an infinite loop of evict -> generate -> evict ...
Yes, great point. to prevent this kind of case, seems like we should also consider free space in S2 above.
12adafc
to
452c3ef
Compare
abca5ce
to
0d86159
Compare
A follow-up to #91. Instead of merely emitting a warning upon a hang detection, we should just cancel hung requests.
I confirmed that the detection works on the test case added in #89 when the engine changes from that PR are removed (thus manually causing a hang). Please merge #89 first.
I didn't update
SyncEngine
for now since it has a different cancellation logic than the staging engine and I think an additional complexity is not worth it given its limited use (mostly for debugging).@jroesch @elvin-n @sunggg