Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sampling penalties and logit bias #125

Merged
merged 6 commits into from
Dec 20, 2023

Conversation

cyx-6
Copy link

@cyx-6 cyx-6 commented Dec 19, 2023

This PR adds the sampling penlaties of frequency penalty and presence penalty. Also it adds the logit bias.

sunggg
sunggg previously requested changes Dec 20, 2023
Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cyx-6!
I'm getting error when I run serve/benchmarks/benchmark_throughput.py.

/opt/bin/cuda-reserve.py --num-gpu 1 python3 serve/benchmarks/benchmark_throughput.py --local-id llama-2-7b-chat-hf-q0f16-presharded-1gpu --use-staging-engine --dataset /opt/models/dataset/ShareGPT_V3_unfiltered_cleaned_split.json
Traceback (most recent call last):
  File "/home/spark/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py", line 374, in run_generation_loop_worker
    output = worker.step()
  File "/home/spark/mlc-llm/serve/mlc_serve/engine/staging_engine_worker.py", line 224, in step
    results = self.text_generator.generate(requests, self.cache_manager.get_cache())
  File "/home/spark/mlc-llm/serve/mlc_serve/model/paged_cache_model.py", line 605, in generate
    out.extend(self.model.generate(decode_requests, kv_cache))
  File "/home/spark/mlc-llm/serve/mlc_serve/model/paged_cache_model.py", line 479, in generate
    next_tokens = sample(logits, sampling_params, self.vocab_size, appeared_token_freqs=self.appeared_token_freqs)
  File "/home/spark/mlc-llm/serve/mlc_serve/model/paged_cache_model.py", line 99, in sample
    freq = appeared_token_freqs[i]
IndexError: list index out of range

Would you take a look? After this fix, it would be nice to share the numbers on H100 before/after this PR to understand its performance impact.

serve/mlc_serve/model/paged_cache_model.py Outdated Show resolved Hide resolved
This PR adds the sampling penlaties of frequency penalty and presence penalty. Also it adds the logit bias.
@masahi masahi merged commit 624a99a into octoml:batch-serving Dec 20, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants