FA3 KV Cache is slower than FA2 KV Cache #1465

DD-DuDa · 2025-01-26T21:01:37Z

GPU I use is NVIDIA H100 80GB HBM3.
I try to benchmark the performance of flash-decoding.

And the figure I show is under the parameter by:

batch_size = 1
q_len = 1
nheads_q = 32
nheads_kv = 8
dim = 128

And I've checked other settings, and cannot reproduce PR#1236

The text was updated successfully, but these errors were encountered:

tridao · 2025-01-26T21:42:51Z

Can you post a short script to benchmark the two?

DD-DuDa · 2025-01-26T22:07:59Z

Hi, thanks for quick response!

Here is the code (Actually I don't know how to install FA2 and FA3 in the same conda env):

import torch
import torch.nn as nn
import numpy as np
# from flash_attn_interface import flash_attn_with_kvcache # FA3
from flash_attn import flash_attn_with_kvcache # FA2
import math
import triton


@triton.testing.perf_report(
    triton.testing.Benchmark(
        x_names=["seq_len"],
        x_vals=[2**i for i in range(10, 20, 1)],
        line_arg='provider',  # Argument name whose value corresponds to a different line in the plot.
        line_vals=['flash-attn-v3'],  # Possible values for `line_arg`.
        line_names=['flash-attn-v3'],  # Label name for the lines.
        styles=[('red', '-')],  # Line color and style.
        plot_name="decoding benchmark",
        args={},  # Values for function arguments not in `x_names` and `y_name`.
    )
)
def benchmark(seq_len, provider):
    torch.random.manual_seed(0)
    device = "cuda"
    dtype = torch.float16

    batch_size = 1
    nheads = 32
    nheads_k = 8
    d = 128

    q = torch.randn(batch_size, 1, nheads, d, device=device, dtype=dtype)
    k_cache = torch.randn(batch_size, seq_len, nheads_k, d, device=device, dtype=dtype)
    v_cache = torch.randn(batch_size, seq_len, nheads_k, d, device=device, dtype=dtype)

    quantiles = [0.5, 0.2, 0.8]

    if provider == 'flash-attn-v3':
        ms, min_ms, max_ms = triton.testing.do_bench(lambda: flash_attn_with_kvcache(q, k_cache, v_cache), quantiles=quantiles)
    

    perf = lambda ms: 4 * seq_len * nheads * d * 1e-12 / (ms * 1e-3)
    return ms, min_ms, max_ms

benchmark.run(show_plots=True, print_data=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FA3 KV Cache is slower than FA2 KV Cache #1465

FA3 KV Cache is slower than FA2 KV Cache #1465

DD-DuDa commented Jan 26, 2025

tridao commented Jan 26, 2025

DD-DuDa commented Jan 26, 2025 •

edited

Loading

FA3 KV Cache is slower than FA2 KV Cache #1465

FA3 KV Cache is slower than FA2 KV Cache #1465

Comments

DD-DuDa commented Jan 26, 2025

tridao commented Jan 26, 2025

DD-DuDa commented Jan 26, 2025 • edited Loading

DD-DuDa commented Jan 26, 2025 •

edited

Loading