LocalLLaMA highlights FlashAttention-4 gains on Blackwell and the limits for everyday GPUs
Original: FlashAttention-4: 1613 TFLOPs/s, 2.7x faster than Triton, written in Python. What it means for inference. View original →
A LocalLLaMA discussion posted on March 24, 2026, pushed FlashAttention-4 back into the center of inference-performance conversation, reaching 132 upvotes and 39 comments. The Reddit summary connected the paper’s raw benchmark numbers to a practical question the community cares about: who actually benefits today, and who has to wait for the ideas to trickle down?
The FlashAttention-4 paper argues that Blackwell changes the bottleneck profile for attention. Tensor core throughput scales faster than shared-memory bandwidth and exponential units, so simply porting older kernels is not enough. The authors report up to 1,613 TFLOPs/s on B200 BF16 attention, around 71% utilization, with up to 1.3x speedup over cuDNN 9.13 and 2.7x over Triton. They get there with redesigned asynchronous pipelines, conditional softmax rescaling, software-emulated exponentials, tensor memory, and 2-CTA MMA support. Another notable detail is implementation: the kernels are written in Python using CuTe-DSL, with compile times reportedly 20-30x faster than the older C++ template style.
- The Reddit post notes that vLLM 0.17.0 already integrates FA-4 automatically on B200.
- PyTorch FlexAttention also has an FA-4 backend, and the post highlights support for GQA, MQA, and sliding-window attention.
- The bad news is hardware coverage: the biggest gains are mainly for Hopper and Blackwell, not A100 or consumer GPUs.
That last point explains the LocalLLaMA tone. The community likes the engineering direction, but many users cannot deploy it yet. Even so, faster kernel iteration and the algorithmic ideas around reducing non-matmul work could influence future inference stacks beyond flagship datacenter hardware.
Primary source: FlashAttention-4 paper on arXiv. Community discussion: LocalLLaMA.
Related Articles
A LocalLLaMA thread spotlights FlashAttention-4, which reports up to 1605 TFLOPs/s on B200 BF16 and introduces pipeline and memory-layout changes tuned for Blackwell constraints.
At GTC on March 16, 2026, NVIDIA announced Dynamo 1.0 as a production-grade open source inference stack for generative and agentic AI. NVIDIA says Dynamo can boost Blackwell inference performance by up to 7x while integrating with major frameworks and cloud providers.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
Comments (0)
No comments yet. Be the first to comment!