Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of remaining number of cache slots (prompt tokens not accounted) #126

Merged
merged 2 commits into from
Dec 19, 2023

Conversation

masahi
Copy link
Member

@masahi masahi commented Dec 19, 2023

@sunggg @elvin-n

This happens when we run out of cache slots so we need to evict some requests. But we are not accounting for the prompt token counts in calculating the number of remaining free blocks, so we fail to detect the need for eviction when we need to.

for seq_id, tokens in self.allocated_decode_tokens.items():
prompt_seq_id = get_prompt_sequence_id(seq_id.request_id)
prompt_tokens = self.allocated_prompt_tokens[prompt_seq_id]
total_tokens.append(prompt_tokens + tokens)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code runs on a hot path and this naive loop incurs noticeable perf regression (5.89 -> 5.78 req / sec) for 13B. However, I suggest merging this as is and follow-up with a non-regressing solution later.

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that this resolves the hang reported by @elvin-n. Thank you @masahi for the hot fix and @elvin-n for spotting the danger!

@sunggg sunggg merged commit f32375a into octoml:batch-serving Dec 19, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants