-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Will vLLM support flash-attention 3 ? #11372
Comments
@WoosukKwon can you help answer this? |
Also see #11194 |
Yes, we are working directly with the authors to bring Flash Attention 3 to vLLM |
Really appreciate the effort guys ! Any estimate when this will be available ? |
Once the FA3 refactor branch lands (I believe currently at https://github.com/Dao-AILab/flash-attention/tree/decode), it should be ready to integrate. Hopefully within a few weeks |
@mgoin I remember that integrating FA2 with vLLM introduced a bunch of issues with block sizes less than 256 and how . I presume the same issues will have to be addressed with FA3.. |
I think we will support small block sizes from the start of FA3 support, this is part of the reason why it is taking longer as we have this as a requirement. |
@mgoin @DarkLight1337 is there any progress on this ? |
We have landed initial support for FA3 with this PR! #12093 Currently only Hopper support is included due to binary size concerns, but look forward to it in this upcoming release. |
Tracking remaining features here #12429 |
🚀 The feature, motivation and pitch
Flash attention 3 seems very promising for running efficient LLM inference on NVIDIA Hopper cards. Are there any plans to support it in the future ?
https://github.com/Dao-AILab/flash-attention
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: