Google Releases Multi-Token Prediction Drafters for Gemma 4
Original: Accelerating Gemma 4: faster inference with multi-token prediction drafters View original →
MTP Drafters for Gemma 4
Google released assistant models for Gemma 4 31B and 26B-A4B that act as multi-token prediction (MTP) drafters. Available on HuggingFace as gemma-4-31B-it-assistant and gemma-4-26B-A4B-it-assistant, they enable speculative decoding: the drafter proposes multiple tokens at once, and the base model verifies them in a single forward pass.
How It Works
Correctly predicted tokens are accepted outright; errors are corrected by the base model. Output quality is identical to standard inference, while throughput increases roughly 1.5-3x in low-batch real-time scenarios where GPU utilization would otherwise be low.
Growing Ecosystem
Qwen3.5+, DeepSeek V3, and GLM4.5+ already support MTP. Once MTP support lands in llama.cpp, local deployment will become broadly accessible. The LocalLLaMA community is tracking which MTP-capable models will be worth testing first once weights and tooling align.
Related Articles
Google DeepMind’s April 2, 2026 X thread introduced Gemma 4 as a new open model family built for reasoning and agentic workflows. Google says the lineup spans E2B, E4B, 26B MoE, and 31B Dense, and adds native function calling, structured JSON output, and longer context windows.
Google's AI Edge team said on April 2, 2026 that Gemma 4 is bringing multi-step agentic workflows to phones, desktops, and edge hardware under an Apache 2.0 license. The launch combines open models, Agent Skills, and LiteRT-LM deployment tooling.
LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.
Comments (0)
No comments yet. Be the first to comment!