FlashAttention-4 targets Blackwell bottlenecks with overlap-first kernel design

Original: FlashAttention-4 View original →

Read in other languages: 한국어日本語
LLM Mar 6, 2026 By Insights AI (Reddit) 1 min read 4 views Source

A Reddit thread in r/LocalLLaMA surfaced the Together AI / research-team release of FlashAttention-4, an attention-kernel redesign aimed at NVIDIA Blackwell GPUs. The central claim is that newer accelerators are increasingly asymmetrical: tensor-core throughput scales faster than shared-memory bandwidth and SFU throughput, so kernel strategy has to optimize overlap instead of raw GEMM speed alone.

The authors state that from H100 to B200, BF16 tensor throughput rises from roughly 1.0 to 2.25 PFLOPs while SFU count and shared-memory bandwidth remain flat. Based on that profile, FlashAttention-4 focuses on two pressure points:

  • Forward pass: overlap MMA work with softmax exponential costs, including a hybrid hardware/software exp path.
  • Backward pass: reduce shared-memory pressure via TMEM placement and Blackwell 2-CTA MMA modes.

The post links to an in-depth technical write-up that details ping-pong tile schedules, conditional online softmax rescaling, TMEM reuse plans, and DSMEM exchange for dQ-related decomposition. It also describes a deterministic mode for backward reduction order, reporting around 85-90% of nondeterministic throughput in their benchmarks.

Reported performance numbers are strong: up to 1605 TFLOPs/s (about 71% utilization) on B200 BF16, up to 1.1-1.3x faster than cuDNN 9.13 in forward, and up to 2.1-2.7x faster than Triton in the same setting. The article also notes ongoing collaboration with NVIDIA cuDNN teams and compares against newer cuDNN versions in benchmark sections.

Implementation details are also notable for practitioners. FlashAttention-4 is described as being written fully in CuTe-DSL (CUTLASS Python DSL), with the team claiming around 20-30x compile-time improvement compared with heavy C++ template workflows.

As always, these are author-published benchmarks and should be validated on workload-specific shapes and masks. Still, the release is a meaningful signal for LLM training and long-context inference stacks where attention remains a first-order cost.

Community source: r/LocalLLaMA thread
Original article: Together AI FlashAttention-4 post

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.