Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sync engine finish reason output #122

Merged
merged 6 commits into from
Dec 19, 2023

Conversation

masahi
Copy link
Member

@masahi masahi commented Dec 15, 2023

Fixes two issues with sync engine:

  • finish_reason is never returned to the user even if a sequence finishes
  • After all sequences in a request finishes, we return n empty sequences with the same finish reason. This is clearly incorrect since some of the sequences might have stopped earlier than those that finish with a length limit.

The second issue exists for the staging engine as well, but it is more difficult to fix since the stop check is done on the main process while the length and termination checks are done by the worker process. Right now we are sending a "stop request" signal from the main to the worker process when a request finishes (https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine.py#L229-L231), but this is incorrect since the stop check needs to be applied per sequence, not per request. I'll follow up with a fix for the staging engine next week.

@sunggg @elvin-n

prompt_len: int,
max_context_length: int,
max_tokens: Optional[int],
) -> bool:
# TODO: currently, we simply return true for both stopping reasons.
# in the future, we can differentiate these two.
# this include prompt tokens and gen tokens so far
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunggg @elvin-n I don't know who wrote this comment originally, but is this still relevant? I don't understand what it is saying

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this comment. I wondered if it would be useful if we return more information whether the generation is stopped because of max context length or max token. It is more like NOTE rather than TODO, but now that I think of this, we can remove this.

if seq.is_finished:
assert seq.finish_reason == FinishReason.Stop, f"{seq.finish_reason.name}"
assert not seq.delta
Copy link
Member Author

@masahi masahi Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunggg Due to the change in sync_engine.py, we are no longer sending an empty sequence after a request finishes. seq.is_finished becomes True when the last token is generated. So this test needed to be updated.

# These tests are broken since we are now imposing no length limit
# if max_tokens = None. The tests do not finish in a reasonable time.
# _test_max_context_length(model_artifact_path, use_staging_engine=True)
# _test_max_context_length(model_artifact_path, use_staging_engine=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of #105, if max_token is None, we are now imposing no limit on the number of generated tokens. So we let generation continue even if max_context_len is reached.

This change in the engine behavior was not intended in #105, but after discussion with @sunggg we agreed that this is the behavior we want going forward. The tests above don't set max_tokens, so they potentially run forever. So I disabled them.

However, this made me wonder if this "generation without length limit" is a safe thing to do - we might end up with a non-terminating request. Do we need to worry about such case in practice, or are we sure that we can rely on stop tokens for termination? @sunggg @elvin-n

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I also have the similar concern: https://github.com/octoml/mlc-llm/pull/122/files/63a3f19ff6d9e593d0394d5850b8a1db4b89de68#r1430465659

I would like to add something related to the engine capability, but the sliding window is bit tricky since it basically allows no limit.

)
self.current_batch.pop(state.request_id)
self.cache_manager.free_request(state)
del self.num_sequences_per_requests[state.request_id]
Copy link
Member Author

@masahi masahi Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I removed empty responses the we have been sending after a request finishes. As I described in the PR description, it is incorrect to send the same finish_reason for all sequences.

I believe this is the right thing to do, but it is potentially a breaking change. So I request a careful review of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the note! Is there any need to have a different logic for sync engine here? If not, can we further consolidate it with staging engine and have a common function? Based on the quick look, seems pretty similar:

for state in list(self.current_batch.values()):

Copy link
Member Author

@masahi masahi Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code is similar but buggy, so it is not relevant.

Sync and staging engines need slightly different logic for termination check / output creation since for the latter, stopping word and length checks are done in different places (main vs worker process).

So for example, if a request is terminated early due to stopping words, even after my upcoming PR the staging worker still needs to send an empty sequence with finish_reason = Stop due to the delay in the stopping word detection. The new code in the staging engine worker looks like this https://github.com/masahi/mlc-llm/blob/ea2a931354fe78c79dbf688f2fd5a991438d16e5/serve/mlc_serve/engine/staging_engine_worker.py#L180-L190 (UPDATE: Now finish_reason = Stop is also sent with the last valid delta from the staging engine as well).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sent #124 for the staging engine fix.

for gen_seq in state.generation_sequences:
if gen_seq.is_finished:
continue
num_context_tokens = prompt_len + len(gen_seq.generated_token_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I missed it when we discussed this over slack last time.

I still think hard stop on max_context_length is an operational decision rather than the engine limitation, so would like to change this to something more related to engine configs, such as max_input_len or max_num_batched_token. Wonder what you guys think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I missed it

What is "it"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fact the engine still hard-stops at the max context length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to follow what other providers do, if there is a standard convention.

@masahi masahi force-pushed the sync-engine-finish-reason branch from 63a3f19 to fc2639b Compare December 18, 2023 19:32
Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @masahi!
Since we'd like to land #124 quickly, I'm okay to discuss about the desirable engine behavior over the max_context_length as a follow-up. Please let me know when you are ready so that we can merge this.

@masahi
Copy link
Member Author

masahi commented Dec 19, 2023

yes it is ready.

@sunggg sunggg merged commit bab5993 into octoml:batch-serving Dec 19, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants