Google Releases Multi-Token Prediction Drafters for Gemma 4
Original: Accelerating Gemma 4: faster inference with multi-token prediction drafters View original →
MTP Drafters for Gemma 4
Google released assistant models for Gemma 4 31B and 26B-A4B that act as multi-token prediction (MTP) drafters. Available on HuggingFace as gemma-4-31B-it-assistant and gemma-4-26B-A4B-it-assistant, they enable speculative decoding: the drafter proposes multiple tokens at once, and the base model verifies them in a single forward pass.
How It Works
Correctly predicted tokens are accepted outright; errors are corrected by the base model. Output quality is identical to standard inference, while throughput increases roughly 1.5-3x in low-batch real-time scenarios where GPU utilization would otherwise be low.
Growing Ecosystem
Qwen3.5+, DeepSeek V3, and GLM4.5+ already support MTP. Once MTP support lands in llama.cpp, local deployment will become broadly accessible. The LocalLLaMA community is tracking which MTP-capable models will be worth testing first once weights and tooling align.
Related Articles
Google DeepMind’s April 2, 2026 X thread introduced Gemma 4 as a new open model family built for reasoning and agentic workflows. Google says the lineup spans E2B, E4B, 26B MoE, and 31B Dense, and adds native function calling, structured JSON output, and longer context windows.
Google's AI Edge team said on April 2, 2026 that Gemma 4 is bringing multi-step agentic workflows to phones, desktops, and edge hardware under an Apache 2.0 license. The launch combines open models, Agent Skills, and LiteRT-LM deployment tooling.
LocalLLaMA latched onto one detail immediately: dense 128B. Mistral Medium 3.5 drew attention because it tries to bundle reasoning, coding, and agent work into a model people can still imagine self-hosting.
Comments (0)
No comments yet. Be the first to comment!