Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) #347

Closed

Conversation

tjtanaa
Copy link

@tjtanaa tjtanaa commented Dec 27, 2024

Description

This is a PR to merge https://github.com/ROCm/vllm/blob/shsanyal_develop_cpa_fp8 optimized attention.cu kernel into llama_fp8_12062024 branch.

CAVEAT

Currently the attention.cu kernel does not support block size of 32 and head size of 64.
The vLLM model unittests are failing as it uses small models e.g. Gemma, Llama which has head size of 64.

Performance over this Feature PR (#346) which is another implementation of faster kvcache dequant

The following is a benchmark_throughput results of Llama-3.1-70B with fp8 dynamic quantization and kv-cache-dtype of fp8_e4m3. For sequence input token length 2048 and output token length 2048:

Branch of vll-rocmfork Req/s Total Tokens/s Output Tokens/s
main 0.29 1196.2 598.1
llama-fp8-12062024 0.28 1152.46 576.23
paged-attn-fp8 #346 0.47 1932.74 966.37
this PR 0.62 2537.03 1268.51

@tjtanaa tjtanaa changed the title [FEAT] Improved PagedAttention FP8 (faster kvcache dequant) [FEAT] Improved PagedAttention FP8 (faster kvcache dequant v2) Dec 27, 2024
@tjtanaa tjtanaa marked this pull request as draft December 27, 2024 16:34
@tjtanaa
Copy link
Author

tjtanaa commented Jan 24, 2025

This PR has been dropped for #385 and #372

@tjtanaa tjtanaa closed this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants