Inception Labs Launches Mercury 2: Diffusion-Based LLM Hits 1,000 Tokens Per Second

Read in other languages: 한국어日本語
LLM Mar 2, 2026 By Insights AI 1 min read 3 views Source

A New Architecture Challenges the LLM Status Quo

On February 24, 2026, Inception Labs launched Mercury 2, the world's first production-grade diffusion-based reasoning language model. Unlike conventional autoregressive models that generate text one token at a time, Mercury 2 starts with a rough draft of the full output and iteratively refines multiple tokens in parallel through a diffusion process — akin to an editor revising an entire draft simultaneously rather than word by word.

Speed and Cost Advantages

Mercury 2 achieves 1,009 tokens per second on NVIDIA Blackwell GPUs — roughly 10–14x faster than Claude Haiku 4.5 with reasoning (89 tokens/s) and GPT-5 Mini (71 tokens/s). End-to-end latency is just 1.7 seconds, compared to 14.4 seconds for Gemini 3 Flash and 23.4 seconds for Claude Haiku 4.5.

Pricing is equally competitive: $0.25 per million input tokens and $0.75 per million output tokens — approximately half the cost of Gemini 3 Flash and four times cheaper than Claude Haiku 4.5.

Benchmark Performance

Mercury 2 scores 74 on GPQA Diamond, 67 on LiveCodeBench, and 71 on IFBench, demonstrating competitive reasoning quality while delivering unprecedented speed. The model supports a 128K context window, tool use, and JSON output.

Diffusion Meets Language Reasoning

Founded by researchers from Stanford, UCLA, and Cornell who pioneered diffusion techniques in image generation, Inception Labs is commercializing that paradigm for text. Mercury 2 is positioned for real-time AI agent workloads and high-frequency API applications where latency and cost are critical constraints.

Source: Inception Labs — Introducing Mercury 2

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.