Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Will vLLM support flash-attention 3 ? #11372

Closed
1 task done
jorgeantonio21 opened this issue Dec 20, 2024 · 10 comments
Closed
1 task done

[Feature]: Will vLLM support flash-attention 3 ? #11372

jorgeantonio21 opened this issue Dec 20, 2024 · 10 comments

Comments

@jorgeantonio21
Copy link

🚀 The feature, motivation and pitch

Flash attention 3 seems very promising for running efficient LLM inference on NVIDIA Hopper cards. Are there any plans to support it in the future ?

https://github.com/Dao-AILab/flash-attention

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@DarkLight1337
Copy link
Member

@WoosukKwon can you help answer this?

@DarkLight1337
Copy link
Member

Also see #11194

@mgoin
Copy link
Member

mgoin commented Dec 20, 2024

Yes, we are working directly with the authors to bring Flash Attention 3 to vLLM

@jorgeantonio21
Copy link
Author

Really appreciate the effort guys ! Any estimate when this will be available ?

@mgoin
Copy link
Member

mgoin commented Dec 20, 2024

Once the FA3 refactor branch lands (I believe currently at https://github.com/Dao-AILab/flash-attention/tree/decode), it should be ready to integrate. Hopefully within a few weeks

@jorgeantonio21
Copy link
Author

@mgoin I remember that integrating FA2 with vLLM introduced a bunch of issues with block sizes less than 256 and how . I presume the same issues will have to be addressed with FA3..

@mgoin
Copy link
Member

mgoin commented Dec 28, 2024

I think we will support small block sizes from the start of FA3 support, this is part of the reason why it is taking longer as we have this as a requirement.

@jorgeantonio21
Copy link
Author

@mgoin @DarkLight1337 is there any progress on this ?

@mgoin
Copy link
Member

mgoin commented Jan 25, 2025

We have landed initial support for FA3 with this PR! #12093

Currently only Hopper support is included due to binary size concerns, but look forward to it in this upcoming release.

@mgoin mgoin closed this as completed Jan 25, 2025
@mgoin
Copy link
Member

mgoin commented Jan 25, 2025

Tracking remaining features here #12429

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants