FlashAttention-4 targets Blackwell bottlenecks with overlap-first kernel design

A Reddit thread in r/LocalLLaMA surfaced the Together AI / research-team release of FlashAttention-4, an attention-kernel redesign aimed at NVIDIA Blackwell GPUs. The central claim is that newer accelerators are increasingly asymmetrical: tensor-core throughput scales faster than shared-memory bandwidth and SFU throughput, so kernel strategy has to optimize overlap instead of raw GEMM speed alone.

The authors state that from H100 to B200, BF16 tensor throughput rises from roughly 1.0 to 2.25 PFLOPs while SFU count and shared-memory bandwidth remain flat. Based on that profile, FlashAttention-4 focuses on two pressure points:

Forward pass: overlap MMA work with softmax exponential costs, including a hybrid hardware/software exp path.
Backward pass: reduce shared-memory pressure via TMEM placement and Blackwell 2-CTA MMA modes.

The post links to an in-depth technical write-up that details ping-pong tile schedules, conditional online softmax rescaling, TMEM reuse plans, and DSMEM exchange for dQ-related decomposition. It also describes a deterministic mode for backward reduction order, reporting around 85-90% of nondeterministic throughput in their benchmarks.

Reported performance numbers are strong: up to 1605 TFLOPs/s (about 71% utilization) on B200 BF16, up to 1.1-1.3x faster than cuDNN 9.13 in forward, and up to 2.1-2.7x faster than Triton in the same setting. The article also notes ongoing collaboration with NVIDIA cuDNN teams and compares against newer cuDNN versions in benchmark sections.

Implementation details are also notable for practitioners. FlashAttention-4 is described as being written fully in CuTe-DSL (CUTLASS Python DSL), with the team claiming around 20-30x compile-time improvement compared with heavy C++ template workflows.

As always, these are author-published benchmarks and should be validated on workload-specific shapes and masks. Still, the release is a meaningful signal for LLM training and long-context inference stacks where attention remains a first-order cost.

Community source: r/LocalLLaMA thread
Original article: Together AI FlashAttention-4 post

FlashAttention-4 targets Blackwell bottlenecks with overlap-first kernel design

Related Articles

LocalLLaMA highlights FlashAttention-4 gains on Blackwell and the limits for everyday GPUs

Nemotron 3 Ultra turns agent cost and runtime into NVIDIA’s pitch

Nemotron 3 Ultra uses 550B MoE design to cut agent costs by 30%