FlashAttention-4 targets Blackwell bottlenecks with overlap-first kernel design
Original: FlashAttention-4 View original →
A Reddit thread in r/LocalLLaMA surfaced the Together AI / research-team release of FlashAttention-4, an attention-kernel redesign aimed at NVIDIA Blackwell GPUs. The central claim is that newer accelerators are increasingly asymmetrical: tensor-core throughput scales faster than shared-memory bandwidth and SFU throughput, so kernel strategy has to optimize overlap instead of raw GEMM speed alone.
The authors state that from H100 to B200, BF16 tensor throughput rises from roughly 1.0 to 2.25 PFLOPs while SFU count and shared-memory bandwidth remain flat. Based on that profile, FlashAttention-4 focuses on two pressure points:
- Forward pass: overlap MMA work with softmax exponential costs, including a hybrid hardware/software exp path.
- Backward pass: reduce shared-memory pressure via TMEM placement and Blackwell 2-CTA MMA modes.
The post links to an in-depth technical write-up that details ping-pong tile schedules, conditional online softmax rescaling, TMEM reuse plans, and DSMEM exchange for dQ-related decomposition. It also describes a deterministic mode for backward reduction order, reporting around 85-90% of nondeterministic throughput in their benchmarks.
Reported performance numbers are strong: up to 1605 TFLOPs/s (about 71% utilization) on B200 BF16, up to 1.1-1.3x faster than cuDNN 9.13 in forward, and up to 2.1-2.7x faster than Triton in the same setting. The article also notes ongoing collaboration with NVIDIA cuDNN teams and compares against newer cuDNN versions in benchmark sections.
Implementation details are also notable for practitioners. FlashAttention-4 is described as being written fully in CuTe-DSL (CUTLASS Python DSL), with the team claiming around 20-30x compile-time improvement compared with heavy C++ template workflows.
As always, these are author-published benchmarks and should be validated on workload-specific shapes and masks. Still, the release is a meaningful signal for LLM training and long-context inference stacks where attention remains a first-order cost.
Community source: r/LocalLLaMA thread
Original article: Together AI FlashAttention-4 post
Related Articles
A technical LocalLLaMA thread translated the FlashAttention-4 paper into practical deployment guidance, emphasizing huge Blackwell gains, faster Python-based kernel development, and the fact that most A100 or consumer-GPU users cannot use the full benefits yet.
A new r/MachineLearning post pushes TurboQuant beyond KV-cache talk and into weight compression, with a GitHub implementation that targets drop-in low-bit LLM inference.
NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.
Comments (0)
No comments yet. Be the first to comment!