FlashAttention-4 targets Blackwell bottlenecks with overlap-first kernel design
Original: FlashAttention-4 View original →
A Reddit thread in r/LocalLLaMA surfaced the Together AI / research-team release of FlashAttention-4, an attention-kernel redesign aimed at NVIDIA Blackwell GPUs. The central claim is that newer accelerators are increasingly asymmetrical: tensor-core throughput scales faster than shared-memory bandwidth and SFU throughput, so kernel strategy has to optimize overlap instead of raw GEMM speed alone.
The authors state that from H100 to B200, BF16 tensor throughput rises from roughly 1.0 to 2.25 PFLOPs while SFU count and shared-memory bandwidth remain flat. Based on that profile, FlashAttention-4 focuses on two pressure points:
- Forward pass: overlap MMA work with softmax exponential costs, including a hybrid hardware/software exp path.
- Backward pass: reduce shared-memory pressure via TMEM placement and Blackwell 2-CTA MMA modes.
The post links to an in-depth technical write-up that details ping-pong tile schedules, conditional online softmax rescaling, TMEM reuse plans, and DSMEM exchange for dQ-related decomposition. It also describes a deterministic mode for backward reduction order, reporting around 85-90% of nondeterministic throughput in their benchmarks.
Reported performance numbers are strong: up to 1605 TFLOPs/s (about 71% utilization) on B200 BF16, up to 1.1-1.3x faster than cuDNN 9.13 in forward, and up to 2.1-2.7x faster than Triton in the same setting. The article also notes ongoing collaboration with NVIDIA cuDNN teams and compares against newer cuDNN versions in benchmark sections.
Implementation details are also notable for practitioners. FlashAttention-4 is described as being written fully in CuTe-DSL (CUTLASS Python DSL), with the team claiming around 20-30x compile-time improvement compared with heavy C++ template workflows.
As always, these are author-published benchmarks and should be validated on workload-specific shapes and masks. Still, the release is a meaningful signal for LLM training and long-context inference stacks where attention remains a first-order cost.
Community source: r/LocalLLaMA thread
Original article: Together AI FlashAttention-4 post
Related Articles
In a February 12, 2026 post, NVIDIA said major inference providers are reducing token costs with open-source frontier models on Blackwell. The article includes partner-reported gains across healthcare, gaming, and enterprise support workloads.
Katana Quant's post, which gained traction on Hacker News, turns a familiar complaint about AI code into a measurable engineering failure. The practical message is straightforward: define acceptance criteria before code generation, not after.
A high-traction Hacker News thread highlighted Simon Willison’s "Agentic Engineering Patterns" guide, which organizes practical workflows for coding agents. The focus is operational discipline: testing-first loops, readable change flow, and reusable prompts.
Comments (0)
No comments yet. Be the first to comment!