Inception Labs Launches Mercury 2: Diffusion-Based LLM Hits 1,000 Tokens Per Second
A New Architecture Challenges the LLM Status Quo
On February 24, 2026, Inception Labs launched Mercury 2, the world's first production-grade diffusion-based reasoning language model. Unlike conventional autoregressive models that generate text one token at a time, Mercury 2 starts with a rough draft of the full output and iteratively refines multiple tokens in parallel through a diffusion process — akin to an editor revising an entire draft simultaneously rather than word by word.
Speed and Cost Advantages
Mercury 2 achieves 1,009 tokens per second on NVIDIA Blackwell GPUs — roughly 10–14x faster than Claude Haiku 4.5 with reasoning (89 tokens/s) and GPT-5 Mini (71 tokens/s). End-to-end latency is just 1.7 seconds, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5.
Pricing is equally competitive: $0.25 per million input tokens and $0.75 per million output tokens — approximately half the cost of Gemini 3 Flash and four times cheaper than Claude Haiku 4.5.
Benchmark Performance
Mercury 2 scores 74 on GPQA Diamond, 67 on LiveCodeBench, and 71 on IFBench, demonstrating competitive reasoning quality while delivering unprecedented speed. The model supports a 128K context window, tool use, and JSON output.
Diffusion Meets Language Reasoning
Founded by researchers from Stanford, UCLA, and Cornell who pioneered diffusion techniques in image generation, Inception Labs is commercializing that paradigm for text. Mercury 2 is positioned for real-time AI agent workloads and high-frequency API applications where latency and cost are critical constraints.
Related Articles
Researchers have demonstrated that transformer models with fewer than 100 parameters can add two 10-digit numbers with 100% accuracy using digit tokenization, challenging assumptions about the minimum complexity needed for arithmetic reasoning.
Anthropic introduced Claude Sonnet 4.6 on February 17, 2026, adding a beta 1M token context window while keeping API pricing at $3/$15 per million tokens. The company says the new default model improves coding, computer use, and long-context reasoning enough to cover more work that previously pushed users toward Opus-class models.
Katana Quant's post, which gained traction on Hacker News, turns a familiar complaint about AI code into a measurable engineering failure. The practical message is straightforward: define acceptance criteria before code generation, not after.
Comments (0)
No comments yet. Be the first to comment!