Inception Labs Launches Mercury 2: Diffusion-Based LLM Hits 1,000 Tokens Per Second
A New Architecture Challenges the LLM Status Quo
On February 24, 2026, Inception Labs launched Mercury 2, the world's first production-grade diffusion-based reasoning language model. Unlike conventional autoregressive models that generate text one token at a time, Mercury 2 starts with a rough draft of the full output and iteratively refines multiple tokens in parallel through a diffusion process — akin to an editor revising an entire draft simultaneously rather than word by word.
Speed and Cost Advantages
Mercury 2 achieves 1,009 tokens per second on NVIDIA Blackwell GPUs — roughly 10–14x faster than Claude Haiku 4.5 with reasoning (89 tokens/s) and GPT-5 Mini (71 tokens/s). End-to-end latency is just 1.7 seconds, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5.
Pricing is equally competitive: $0.25 per million input tokens and $0.75 per million output tokens — approximately half the cost of Gemini 3 Flash and four times cheaper than Claude Haiku 4.5.
Benchmark Performance
Mercury 2 scores 74 on GPQA Diamond, 67 on LiveCodeBench, and 71 on IFBench, demonstrating competitive reasoning quality while delivering unprecedented speed. The model supports a 128K context window, tool use, and JSON output.
Diffusion Meets Language Reasoning
Founded by researchers from Stanford, UCLA, and Cornell who pioneered diffusion techniques in image generation, Inception Labs is commercializing that paradigm for text. Mercury 2 is positioned for real-time AI agent workloads and high-frequency API applications where latency and cost are critical constraints.
Related Articles
HN reacted fast because I-DLM is not selling faster text generation someday; it is claiming diffusion-style decoding can keep pace with autoregressive quality now. The thread quickly turned into a reality check on whether the 2.9x-4.1x throughput story can survive real inference stacks.
A high-ranking Hacker News thread amplified Apple's paper on simple self-distillation for code generation, a training recipe that improves pass@1 without verifier models or reinforcement learning.
Stanford's public CS25 course is again operating as an open lecture stream for Transformer research, with Zoom access, recordings, and a community layer that extends beyond campus.
Comments (0)
No comments yet. Be the first to comment!