LocalLLaMA highlights FlashAttention-4 gains on Blackwell and the limits for everyday GPUs

Original: FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. View original →

Read in other languages: 한국어日本語
LLM Mar 24, 2026 By Insights AI (Reddit) 1 min read 1 views Source

A LocalLLaMA discussion posted on March 24, 2026, pushed FlashAttention-4 back into the center of inference-performance conversation, reaching 132 upvotes and 39 comments. The Reddit summary connected the paper’s raw benchmark numbers to a practical question the community cares about: who actually benefits today, and who has to wait for the ideas to trickle down?

The FlashAttention-4 paper argues that Blackwell changes the bottleneck profile for attention. Tensor core throughput scales faster than shared-memory bandwidth and exponential units, so simply porting older kernels is not enough. The authors report up to 1,613 TFLOPs/s on B200 BF16 attention, around 71% utilization, with up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton. They get there with redesigned asynchronous pipelines, conditional softmax rescaling, software-emulated exponentials, tensor memory, and 2-CTA MMA support. Another notable detail is implementation: the kernels are written in Python using CuTe-DSL, with compile times reportedly 20-30x faster than the older C++ template style.

  • The Reddit post notes that vLLM 0.17.0 already integrates FA-4 automatically on B200.
  • PyTorch FlexAttention also has an FA-4 backend, and the post highlights support for GQA, MQA, and sliding-window attention.
  • The bad news is hardware coverage: the biggest gains are mainly for Hopper and Blackwell, not A100 or consumer GPUs.

That last point explains the LocalLLaMA tone. The community likes the engineering direction, but many users cannot deploy it yet. Even so, faster kernel iteration and the algorithmic ideas around reducing non-matmul work could influence future inference stacks beyond flagship datacenter hardware.

Primary source: FlashAttention-4 paper on arXiv. Community discussion: LocalLLaMA.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.