Skip to content

DiffusionGemma cuts the token bottleneck with a 26B open model

Original: DiffusionGemma: 4x faster text generation View original →

Read in other languages: 한국어日本語
LLM Jun 12, 2026 By Insights AI 2 min read 1 views Source

DiffusionGemma is aimed at one of local AI’s least glamorous constraints: waiting for tokens to arrive one after another. On June 10, 2026, Google DeepMind released DiffusionGemma, a 26B Mixture of Experts open model that uses text diffusion to generate blocks of text in parallel rather than decoding strictly left to right.

The architectural bet is simple to state and hard to make useful. Autoregressive models behave like typewriters, producing the next token in sequence. DiffusionGemma drafts a 256-token block, then iteratively refines placeholder tokens into final text. That gives every token access to the wider block during generation and shifts more work onto the GPU at once.

Google’s performance claims are specific. The company says DiffusionGemma can deliver up to 4x faster token output on dedicated GPUs, including 1000+ tokens per second on a single NVIDIA H100 and 700+ tokens per second on an NVIDIA GeForce RTX 5090. Although the model totals 26B parameters, only 3.8B are active during inference, and a quantized version is designed to fit within 18GB VRAM on high-end consumer GPUs.

The release matters because it is usable outside a closed demo. The weights are available under Apache 2.0, with paths through Hugging Face, Transformers, MLX, vLLM, NVIDIA NIM and Google’s own developer guide. NVIDIA optimization targets GeForce RTX 4090 and 5090 systems, RTX PRO hardware, DGX Spark and other local or deskside environments.

Google is also clear about the trade-off. Standard Gemma 4 remains the recommendation when output quality is the top priority. DiffusionGemma is more interesting for speed-sensitive, low-concurrency workflows: inline editing, rapid iteration, code infilling, amino acid sequences, mathematical graphs and other tasks where bidirectional attention over a small block can be more useful than pure next-token prediction.

The open question is whether diffusion language models can move from research novelty to everyday developer infrastructure. If the quality gap narrows while the latency advantage holds, local AI tools may start looking less like cloud chat clients and more like responsive editing engines.

Share: Long

Related Articles