Fix sync engine finish reason output #122

masahi · 2023-12-15T05:14:33Z

Fixes two issues with sync engine:

finish_reason is never returned to the user even if a sequence finishes
After all sequences in a request finishes, we return n empty sequences with the same finish reason. This is clearly incorrect since some of the sequences might have stopped earlier than those that finish with a length limit.

The second issue exists for the staging engine as well, but it is more difficult to fix since the stop check is done on the main process while the length and termination checks are done by the worker process. Right now we are sending a "stop request" signal from the main to the worker process when a request finishes (https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/staging_engine.py#L229-L231), but this is incorrect since the stop check needs to be applied per sequence, not per request. I'll follow up with a fix for the staging engine next week.

@sunggg @elvin-n

masahi · 2023-12-15T05:28:11Z

serve/mlc_serve/engine/engine_common.py

+    prompt_len: int,
+    max_context_length: int,
+    max_tokens: Optional[int],
+) -> bool:
    # TODO: currently, we simply return true for both stopping reasons.
    #       in the future, we can differentiate these two.
    # this include prompt tokens and gen tokens so far


@sunggg @elvin-n I don't know who wrote this comment originally, but is this still relevant? I don't understand what it is saying

I wrote this comment. I wondered if it would be useful if we return more information whether the generation is stopped because of max context length or max token. It is more like NOTE rather than TODO, but now that I think of this, we can remove this.

masahi · 2023-12-15T05:30:25Z

serve/tests/unittest/test_engine_with_samplers.py

            if seq.is_finished:
                assert seq.finish_reason == FinishReason.Stop, f"{seq.finish_reason.name}"
-                assert not seq.delta


@sunggg Due to the change in sync_engine.py, we are no longer sending an empty sequence after a request finishes. seq.is_finished becomes True when the last token is generated. So this test needed to be updated.

masahi · 2023-12-15T05:38:18Z

serve/tests/unittest/test_engine_with_samplers.py

+    # These tests are broken since we are now imposing no length limit
+    # if max_tokens = None. The tests do not finish in a reasonable time.
+    # _test_max_context_length(model_artifact_path, use_staging_engine=True)
+    # _test_max_context_length(model_artifact_path, use_staging_engine=False)


As of #105, if max_token is None, we are now imposing no limit on the number of generated tokens. So we let generation continue even if max_context_len is reached.

This change in the engine behavior was not intended in #105, but after discussion with @sunggg we agreed that this is the behavior we want going forward. The tests above don't set max_tokens, so they potentially run forever. So I disabled them.

However, this made me wonder if this "generation without length limit" is a safe thing to do - we might end up with a non-terminating request. Do we need to worry about such case in practice, or are we sure that we can rely on stop tokens for termination? @sunggg @elvin-n

Yep, I also have the similar concern: https://github.com/octoml/mlc-llm/pull/122/files/63a3f19ff6d9e593d0394d5850b8a1db4b89de68#r1430465659

I would like to add something related to the engine capability, but the sliding window is bit tricky since it basically allows no limit.

masahi · 2023-12-15T05:41:27Z

serve/mlc_serve/engine/sync_engine.py

-                )
-                self.current_batch.pop(state.request_id)
-                self.cache_manager.free_request(state)
-                del self.num_sequences_per_requests[state.request_id]


Here I removed empty responses the we have been sending after a request finishes. As I described in the PR description, it is incorrect to send the same finish_reason for all sequences.

I believe this is the right thing to do, but it is potentially a breaking change. So I request a careful review of this.

Thank you for the note! Is there any need to have a different logic for sync engine here? If not, can we further consolidate it with staging engine and have a common function? Based on the quick look, seems pretty similar:

mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py

Line 174 in 63a3f19

for state in list(self.current_batch.values()):

The current code is similar but buggy, so it is not relevant.

Sync and staging engines need slightly different logic for termination check / output creation since for the latter, stopping word and length checks are done in different places (main vs worker process).

So for example, if a request is terminated early due to stopping words, even after my upcoming PR the staging worker still needs to send an empty sequence with finish_reason = Stop due to the delay in the stopping word detection. The new code in the staging engine worker looks like this https://github.com/masahi/mlc-llm/blob/ea2a931354fe78c79dbf688f2fd5a991438d16e5/serve/mlc_serve/engine/staging_engine_worker.py#L180-L190 (UPDATE: Now finish_reason = Stop is also sent with the last valid delta from the staging engine as well).

Just sent #124 for the staging engine fix.

sunggg · 2023-12-18T17:37:34Z

serve/mlc_serve/engine/engine_common.py

-    for gen_seq in state.generation_sequences:
-        if gen_seq.is_finished:
-            continue
+    num_context_tokens = prompt_len + len(gen_seq.generated_token_ids)


Seems like I missed it when we discussed this over slack last time.

I still think hard stop on max_context_length is an operational decision rather than the engine limitation, so would like to change this to something more related to engine configs, such as max_input_len or max_num_batched_token. Wonder what you guys think.

Seems like I missed it

What is "it"?

the fact the engine still hard-stops at the max context length.

I think we need to follow what other providers do, if there is a standard convention.

sunggg

LGTM, thanks @masahi!
Since we'd like to land #124 quickly, I'm okay to discuss about the desirable engine behavior over the max_context_length as a follow-up. Please let me know when you are ready so that we can merge this.

masahi · 2023-12-19T18:54:22Z

yes it is ready.

masahi commented Dec 15, 2023

View reviewed changes

sunggg reviewed Dec 18, 2023

View reviewed changes

masahi added 6 commits December 18, 2023 19:31

propagate finish reason from engine to output properly

a288200

restore max_tokens = None behavior

9e86b8f

Return outputs per sequence in SyncEngine

bb85cdd

fix test

ddf04df

Disable max_context tests since they do not finish

3085c9b

remove stale comment

fc2639b

masahi force-pushed the sync-engine-finish-reason branch from 63a3f19 to fc2639b Compare December 18, 2023 19:32

masahi mentioned this pull request Dec 19, 2023

Fix staging engine finish reason #124

Merged

sunggg approved these changes Dec 19, 2023

View reviewed changes

sunggg merged commit bab5993 into octoml:batch-serving Dec 19, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sync engine finish reason output #122

Fix sync engine finish reason output #122

masahi commented Dec 15, 2023 •

edited

Loading

masahi Dec 15, 2023

sunggg Dec 18, 2023

masahi Dec 15, 2023 •

edited

Loading

masahi Dec 15, 2023

sunggg Dec 18, 2023

masahi Dec 15, 2023 •

edited

Loading

sunggg Dec 18, 2023

masahi Dec 18, 2023 •

edited

Loading

masahi Dec 19, 2023

sunggg Dec 18, 2023

masahi Dec 18, 2023

sunggg Dec 19, 2023

masahi Dec 19, 2023

sunggg left a comment

masahi commented Dec 19, 2023

Fix sync engine finish reason output #122

Fix sync engine finish reason output #122

Conversation

masahi commented Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Dec 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunggg left a comment

Choose a reason for hiding this comment

masahi commented Dec 19, 2023

masahi commented Dec 15, 2023 •

edited

Loading

masahi Dec 15, 2023 •

edited

Loading

masahi Dec 15, 2023 •

edited

Loading

masahi Dec 18, 2023 •

edited

Loading