Google Releases Multi-Token Prediction Drafters for Gemma 4

Original: Accelerating Gemma 4: faster inference with multi-token prediction drafters View original →

Read in other languages: 한국어日本語
LLM May 5, 2026 By Insights AI (HN) 1 min read Source

MTP Drafters for Gemma 4

Google released assistant models for Gemma 4 31B and 26B-A4B that act as multi-token prediction (MTP) drafters. Available on HuggingFace as gemma-4-31B-it-assistant and gemma-4-26B-A4B-it-assistant, they enable speculative decoding: the drafter proposes multiple tokens at once, and the base model verifies them in a single forward pass.

How It Works

Correctly predicted tokens are accepted outright; errors are corrected by the base model. Output quality is identical to standard inference, while throughput increases roughly 1.5-3x in low-batch real-time scenarios where GPU utilization would otherwise be low.

Growing Ecosystem

Qwen3.5+, DeepSeek V3, and GLM4.5+ already support MTP. Once MTP support lands in llama.cpp, local deployment will become broadly accessible. The LocalLLaMA community is tracking which MTP-capable models will be worth testing first once weights and tooling align.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment