Google Releases Multi-Token Prediction Drafters for Gemma 4: Up to 3x Speedup

Original: Gemma 4 MTP released View original →

Read in other languages: 한국어日本語
LLM May 6, 2026 By Insights AI (Reddit) 1 min read 1 views Source

Gemma 4 Gets MTP Drafters

Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family. The announcement generated nearly 1,000 upvotes on r/LocalLLaMA, making it one of the week's most discussed local LLM releases.

Up to 3x Faster Without Quality Loss

MTP drafters use a specialized speculative decoding architecture: a smaller, faster draft model predicts several tokens ahead, which the target model verifies in parallel. Google reports up to a 3x speedup in tokens-per-second with no degradation in output quality or reasoning capability.

Technical Background

Standard LLM inference is memory-bandwidth bound — the processor spends most of its time moving billions of parameters from VRAM to compute units to generate each token. MTP relieves this bottleneck by reducing the number of full-model passes required, making better use of underutilized compute.

Available Models and Platforms

The released drafters support Gemma 4 31B-IT, 26B-A4B-IT (MoE), E4B, and E2B. All are available on Hugging Face. Speed improvements have been tested on LiteRT-LM, MLX, Hugging Face Transformers, and vLLM. Gemma 4, Google's most capable open model to date, reached 60 million downloads in just three weeks since launch.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment