Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference
Original: We rebuilt how MoE models generate tokens on Blackwell GPUs, resulting in 1.84x faster inference and more accurate outputs. These improvements directly contribute to how we train Composer, allowing us to ship improved versions of the model more often. View original →
In an April 6 X post, Cursor said it rebuilt how mixture-of-experts models generate tokens on NVIDIA Blackwell GPUs, claiming 1.84x faster inference and more accurate outputs. The linked engineering post introduces "warp decode," a kernel design that reorganizes decode-time computation around outputs rather than experts. Cursor says the change directly helps how it trains and serves Composer, its coding model.
The technical argument is that conventional MoE decode pipelines are expert-centric: they gather tokens per expert, pad and scatter data, run the math, and then combine results. Cursor argues that this makes sense for prefill and large batches but wastes work at autoregressive decode time, especially with small batch sizes on Blackwell. Warp decode flips that design. Each warp is assigned one output scalar, streams the needed weight rows directly, accumulates over routed experts, and writes only the final result. Cursor says this lets the system eliminate padding, scatter, combine, and intermediate buffers, compressing the MoE layer into two kernels.
Why Cursor says the change matters for Composer
Cursor’s blog claims the new path improves both speed and numerical quality. On its internal inference system running a Qwen-3-style model on NVIDIA B200 GPUs, the company reports a flat 1.84x decode-throughput gain across context lengths and outputs 1.4x closer to a full FP32 reference than the classical path. It also says warp decode sustains 3.95 TB/s on B200, about 58% of measured peak memory-read throughput. Those are infrastructure-level gains, but Cursor explicitly ties them back to product velocity: faster and cleaner inference helps the company train Composer faster and ship improved versions more often.
The broader takeaway is that AI model competition is not just about better training data or larger models. At this stage, low-level inference engineering can translate directly into faster iteration cycles and cheaper deployment for developer-facing products. Cursor is making the case that GPU kernel design itself is now a product lever.
Related Articles
The draw for LocalLLaMA was not just another coding model, but Cohere asking the local-inference crowd to test pre-release weights first.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
NVIDIA announced Dynamo 1.0 on March 16, 2026 as a production-grade open-source layer for generative and agentic inference. The release matters because it ties Blackwell performance gains, lower token economics and native integration with major open-source frameworks into one operating model.