Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference
Original: We rebuilt how MoE models generate tokens on Blackwell GPUs, resulting in 1.84x faster inference and more accurate outputs. These improvements directly contribute to how we train Composer, allowing us to ship improved versions of the model more often. View original →
In an April 6 X post, Cursor said it rebuilt how mixture-of-experts models generate tokens on NVIDIA Blackwell GPUs, claiming 1.84x faster inference and more accurate outputs. The linked engineering post introduces "warp decode," a kernel design that reorganizes decode-time computation around outputs rather than experts. Cursor says the change directly helps how it trains and serves Composer, its coding model.
The technical argument is that conventional MoE decode pipelines are expert-centric: they gather tokens per expert, pad and scatter data, run the math, and then combine results. Cursor argues that this makes sense for prefill and large batches but wastes work at autoregressive decode time, especially with small batch sizes on Blackwell. Warp decode flips that design. Each warp is assigned one output scalar, streams the needed weight rows directly, accumulates over routed experts, and writes only the final result. Cursor says this lets the system eliminate padding, scatter, combine, and intermediate buffers, compressing the MoE layer into two kernels.
Why Cursor says the change matters for Composer
Cursor’s blog claims the new path improves both speed and numerical quality. On its internal inference system running a Qwen-3-style model on NVIDIA B200 GPUs, the company reports a flat 1.84x decode-throughput gain across context lengths and outputs 1.4x closer to a full FP32 reference than the classical path. It also says warp decode sustains 3.95 TB/s on B200, about 58% of measured peak memory-read throughput. Those are infrastructure-level gains, but Cursor explicitly ties them back to product velocity: faster and cleaner inference helps the company train Composer faster and ship improved versions more often.
The broader takeaway is that AI model competition is not just about better training data or larger models. At this stage, low-level inference engineering can translate directly into faster iteration cycles and cheaper deployment for developer-facing products. Cursor is making the case that GPU kernel design itself is now a product lever.
Related Articles
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A technical LocalLLaMA thread translated the FlashAttention-4 paper into practical deployment guidance, emphasizing huge Blackwell gains, faster Python-based kernel development, and the fact that most A100 or consumer-GPU users cannot use the full benefits yet.
At GTC on March 16, 2026, NVIDIA announced Dynamo 1.0 as a production-grade open source inference stack for generative and agentic AI. NVIDIA says Dynamo can boost Blackwell inference performance by up to 7x while integrating with major frameworks and cloud providers.
Comments (0)
No comments yet. Be the first to comment!