-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix sync engine finish reason output #122
Fix sync engine finish reason output #122
Conversation
prompt_len: int, | ||
max_context_length: int, | ||
max_tokens: Optional[int], | ||
) -> bool: | ||
# TODO: currently, we simply return true for both stopping reasons. | ||
# in the future, we can differentiate these two. | ||
# this include prompt tokens and gen tokens so far |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote this comment. I wondered if it would be useful if we return more information whether the generation is stopped because of max context length or max token. It is more like NOTE
rather than TODO
, but now that I think of this, we can remove this.
if seq.is_finished: | ||
assert seq.finish_reason == FinishReason.Stop, f"{seq.finish_reason.name}" | ||
assert not seq.delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sunggg Due to the change in sync_engine.py
, we are no longer sending an empty sequence after a request finishes. seq.is_finished
becomes True when the last token is generated. So this test needed to be updated.
# These tests are broken since we are now imposing no length limit | ||
# if max_tokens = None. The tests do not finish in a reasonable time. | ||
# _test_max_context_length(model_artifact_path, use_staging_engine=True) | ||
# _test_max_context_length(model_artifact_path, use_staging_engine=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of #105, if max_token
is None, we are now imposing no limit on the number of generated tokens. So we let generation continue even if max_context_len
is reached.
This change in the engine behavior was not intended in #105, but after discussion with @sunggg we agreed that this is the behavior we want going forward. The tests above don't set max_tokens
, so they potentially run forever. So I disabled them.
However, this made me wonder if this "generation without length limit" is a safe thing to do - we might end up with a non-terminating request. Do we need to worry about such case in practice, or are we sure that we can rely on stop tokens for termination? @sunggg @elvin-n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I also have the similar concern: https://github.com/octoml/mlc-llm/pull/122/files/63a3f19ff6d9e593d0394d5850b8a1db4b89de68#r1430465659
I would like to add something related to the engine capability, but the sliding window is bit tricky since it basically allows no limit.
) | ||
self.current_batch.pop(state.request_id) | ||
self.cache_manager.free_request(state) | ||
del self.num_sequences_per_requests[state.request_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I removed empty responses the we have been sending after a request finishes. As I described in the PR description, it is incorrect to send the same finish_reason
for all sequences.
I believe this is the right thing to do, but it is potentially a breaking change. So I request a careful review of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the note! Is there any need to have a different logic for sync engine here? If not, can we further consolidate it with staging engine and have a common function? Based on the quick look, seems pretty similar:
for state in list(self.current_batch.values()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code is similar but buggy, so it is not relevant.
Sync and staging engines need slightly different logic for termination check / output creation since for the latter, stopping word and length checks are done in different places (main vs worker process).
So for example, if a request is terminated early due to stopping words, even after my upcoming PR the staging worker still needs to send an empty sequence with (UPDATE: Now finish_reason = Stop
due to the delay in the stopping word detection. The new code in the staging engine worker looks like this https://github.com/masahi/mlc-llm/blob/ea2a931354fe78c79dbf688f2fd5a991438d16e5/serve/mlc_serve/engine/staging_engine_worker.py#L180-L190finish_reason = Stop
is also sent with the last valid delta from the staging engine as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just sent #124 for the staging engine fix.
for gen_seq in state.generation_sequences: | ||
if gen_seq.is_finished: | ||
continue | ||
num_context_tokens = prompt_len + len(gen_seq.generated_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I missed it when we discussed this over slack last time.
I still think hard stop on max_context_length
is an operational decision rather than the engine limitation, so would like to change this to something more related to engine configs, such as max_input_len
or max_num_batched_token
. Wonder what you guys think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like I missed it
What is "it"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the fact the engine still hard-stops at the max context length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to follow what other providers do, if there is a standard convention.
63a3f19
to
fc2639b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is ready. |
Fixes two issues with sync engine:
finish_reason
is never returned to the user even if a sequence finishesn
empty sequences with the same finish reason. This is clearly incorrect since some of the sequences might have stopped earlier than those that finish with a length limit.The second issue exists for the staging engine as well, but it is more difficult to fix since the stop check is done on the main process while the length and termination checks are done by the worker process. Right now we are sending a "stop request" signal from the main to the worker process when a request finishes (https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine.py#L229-L231), but this is incorrect since the stop check needs to be applied per sequence, not per request. I'll follow up with a fix for the staging engine next week.
@sunggg @elvin-n