[Feature]: Will vLLM support flash-attention 3 ? #11372

jorgeantonio21 · 2024-12-20T11:07:43Z

🚀 The feature, motivation and pitch

Flash attention 3 seems very promising for running efficient LLM inference on NVIDIA Hopper cards. Are there any plans to support it in the future ?

https://github.com/Dao-AILab/flash-attention

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 · 2024-12-20T14:27:07Z

@WoosukKwon can you help answer this?

DarkLight1337 · 2024-12-20T14:37:14Z

Also see #11194

mgoin · 2024-12-20T16:39:22Z

Yes, we are working directly with the authors to bring Flash Attention 3 to vLLM

jorgeantonio21 · 2024-12-20T19:10:23Z

Really appreciate the effort guys ! Any estimate when this will be available ?

mgoin · 2024-12-20T19:36:53Z

Once the FA3 refactor branch lands (I believe currently at https://github.com/Dao-AILab/flash-attention/tree/decode), it should be ready to integrate. Hopefully within a few weeks

jorgeantonio21 · 2024-12-27T14:59:53Z

@mgoin I remember that integrating FA2 with vLLM introduced a bunch of issues with block sizes less than 256 and how . I presume the same issues will have to be addressed with FA3..

mgoin · 2024-12-28T00:34:58Z

I think we will support small block sizes from the start of FA3 support, this is part of the reason why it is taking longer as we have this as a requirement.

jorgeantonio21 · 2025-01-11T18:31:15Z

@mgoin @DarkLight1337 is there any progress on this ?

mgoin · 2025-01-25T19:42:56Z

We have landed initial support for FA3 with this PR! #12093

Currently only Hopper support is included due to binary size concerns, but look forward to it in this upcoming release.

mgoin · 2025-01-25T19:59:44Z

Tracking remaining features here #12429

jorgeantonio21 added the feature request label Dec 20, 2024

mgoin closed this as completed Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Will vLLM support flash-attention 3 ? #11372

[Feature]: Will vLLM support flash-attention 3 ? #11372

jorgeantonio21 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

mgoin commented Dec 20, 2024

jorgeantonio21 commented Dec 20, 2024

mgoin commented Dec 20, 2024

jorgeantonio21 commented Dec 27, 2024

mgoin commented Dec 28, 2024

jorgeantonio21 commented Jan 11, 2025

mgoin commented Jan 25, 2025

mgoin commented Jan 25, 2025

[Feature]: Will vLLM support flash-attention 3 ? #11372

[Feature]: Will vLLM support flash-attention 3 ? #11372

Comments

jorgeantonio21 commented Dec 20, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

DarkLight1337 commented Dec 20, 2024

DarkLight1337 commented Dec 20, 2024

mgoin commented Dec 20, 2024

jorgeantonio21 commented Dec 20, 2024

mgoin commented Dec 20, 2024

jorgeantonio21 commented Dec 27, 2024

mgoin commented Dec 28, 2024

jorgeantonio21 commented Jan 11, 2025

mgoin commented Jan 25, 2025

mgoin commented Jan 25, 2025