Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference

In an April 6 X post, Cursor said it rebuilt how mixture-of-experts models generate tokens on NVIDIA Blackwell GPUs, claiming 1.84x faster inference and more accurate outputs. The linked engineering post introduces "warp decode," a kernel design that reorganizes decode-time computation around outputs rather than experts. Cursor says the change directly helps how it trains and serves Composer, its coding model.

The technical argument is that conventional MoE decode pipelines are expert-centric: they gather tokens per expert, pad and scatter data, run the math, and then combine results. Cursor argues that this makes sense for prefill and large batches but wastes work at autoregressive decode time, especially with small batch sizes on Blackwell. Warp decode flips that design. Each warp is assigned one output scalar, streams the needed weight rows directly, accumulates over routed experts, and writes only the final result. Cursor says this lets the system eliminate padding, scatter, combine, and intermediate buffers, compressing the MoE layer into two kernels.

Why Cursor says the change matters for Composer

Cursor’s blog claims the new path improves both speed and numerical quality. On its internal inference system running a Qwen-3-style model on NVIDIA B200 GPUs, the company reports a flat 1.84x decode-throughput gain across context lengths and outputs 1.4x closer to a full FP32 reference than the classical path. It also says warp decode sustains 3.95 TB/s on B200, about 58% of measured peak memory-read throughput. Those are infrastructure-level gains, but Cursor explicitly ties them back to product velocity: faster and cleaner inference helps the company train Composer faster and ship improved versions more often.

The broader takeaway is that AI model competition is not just about better training data or larger models. At this stage, low-level inference engineering can translate directly into faster iteration cycles and cheaper deployment for developer-facing products. Cursor is making the case that GPU kernel design itself is now a product lever.

Cursor details warp decode for Blackwell GPUs, claiming 1.84x faster MoE inference

Why Cursor says the change matters for Composer

Related Articles

Cohere gives LocalLLaMA first hands-on access to an unreleased coding model

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

NVIDIA puts Dynamo 1.0 into production as an inference OS for AI factories

Related Articles

Cohere gives LocalLLaMA first hands-on access to an unreleased coding model

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read

NVIDIA puts Dynamo 1.0 into production as an inference OS for AI factories
LLM Mar 30, 2026 2 min read