Qwen's FlashQLA lifts linear attention speed 2-3x on Hopper
Original: Qwen released FlashQLA with 2-3x forward linear attention speedup View original →
What the release actually shipped
Alibaba’s Qwen team used X to turn a low-level systems result into a concrete open-source release. The official account introduced FlashQLA, a linear-attention kernel library built on TileLang, and led with the numbers that matter: 2-3x forward speedup and 2x backward speedup. That is material because long-context and edge-side agent workloads are often constrained less by model quality than by whether attention kernels stay fast enough once sequence length and memory pressure climb.
“2-3× forward speedup. 2× backward speedup.”
The linked GitHub repository adds the engineering context behind the headline. FlashQLA targets GDN Chunked Prefill and benchmarks against the FLA Triton baseline on NVIDIA Hopper hardware across head configurations used in the Qwen3.5 and Qwen3.6 families. The README says the gains are especially strong in pretraining scenarios and edge-side agentic inference. Qwen attributes the improvement to three design choices: gate-driven automatic intra-card context parallelism, a hardware-friendly algebraic reformulation of the forward and backward flows, and fused warp-specialized kernels built in TileLang.
Why kernel releases like this matter to model strategy
This is the kind of infrastructure work that decides whether “run it locally” or “serve it cheaply” stays marketing talk. Qwen is explicitly pitching FlashQLA for long-context workloads, smaller models, TP-heavy setups, and personal-device agents, all places where inefficient kernels quickly erase the appeal of a model family. The public repo also means the release is inspectable rather than purely promotional; developers can read the code, check the benchmark setup, and test whether the speedups survive outside Qwen’s own stack.
The Qwen account usually uses X for model and systems releases tied to real artifacts, and this launch follows that pattern. What to watch next is adoption: whether FlashQLA lands in broader open inference runtimes, whether the same gains hold beyond Hopper-class hardware, and whether the edge-side story proves out for real agent deployments instead of benchmark demos. Source: Qwen source tweet · Qwen blog entry · GitHub repository
Related Articles
Kernel work can shift the cost curve faster than another small model launch, and Qwen is leaning into that angle. In its X post, the team claimed 2–3x forward speedups and 2x backward speedups for Hopper-based linear attention workloads, with code already live on GitHub.
The top comment went straight to the CP joke, but the post held because the technical claim was concrete: 2-3x forward speedups and 2x backward speedups for GDN chunked prefill, aimed at long-context and edge-side agentic inference.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
Comments (0)
No comments yet. Be the first to comment!