Inception Labs Launches Mercury 2: Diffusion-Based LLM Hits 1,000 Tokens Per Second

A New Architecture Challenges the LLM Status Quo

On February 24, 2026, Inception Labs launched Mercury 2, the world's first production-grade diffusion-based reasoning language model. Unlike conventional autoregressive models that generate text one token at a time, Mercury 2 starts with a rough draft of the full output and iteratively refines multiple tokens in parallel through a diffusion process — akin to an editor revising an entire draft simultaneously rather than word by word.

Speed and Cost Advantages

Mercury 2 achieves 1,009 tokens per second on NVIDIA Blackwell GPUs — roughly 10–14x faster than Claude Haiku 4.5 with reasoning (89 tokens/s) and GPT-5 Mini (71 tokens/s). End-to-end latency is just 1.7 seconds, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5.

Pricing is equally competitive: $0.25 per million input tokens and $0.75 per million output tokens — approximately half the cost of Gemini 3 Flash and four times cheaper than Claude Haiku 4.5.

Benchmark Performance

Mercury 2 scores 74 on GPQA Diamond, 67 on LiveCodeBench, and 71 on IFBench, demonstrating competitive reasoning quality while delivering unprecedented speed. The model supports a 128K context window, tool use, and JSON output.

Diffusion Meets Language Reasoning

Founded by researchers from Stanford, UCLA, and Cornell who pioneered diffusion techniques in image generation, Inception Labs is commercializing that paradigm for text. Mercury 2 is positioned for real-time AI agent workloads and high-frequency API applications where latency and cost are critical constraints.

Source: Inception Labs — Introducing Mercury 2

Inception Labs Launches Mercury 2: Diffusion-Based LLM Hits 1,000 Tokens Per Second

A New Architecture Challenges the LLM Status Quo

Speed and Cost Advantages

Benchmark Performance

Diffusion Meets Language Reasoning

Related Articles

HN is stress-testing I-DLM, a diffusion LLM that says it can keep AR quality

HN thread spotlights a simple self-distillation recipe for stronger code generation

Reddit Spotlights Stanford's Open CS25 Transformers Course for Spring 2026

Comments (0)

Leave a Comment

Related Articles

HN is stress-testing I-DLM, a diffusion LLM that says it can keep AR quality
LLM Hacker News Apr 15, 2026 2 min read

HN thread spotlights a simple self-distillation recipe for stronger code generation
LLM Hacker News Apr 5, 2026 2 min read

Reddit Spotlights Stanford's Open CS25 Transformers Course for Spring 2026
LLM Reddit Apr 3, 2026 2 min read