Google Releases Multi-Token Prediction Drafters for Gemma 4: Up to 3x Speedup
Original: Gemma 4 MTP released View original →
Gemma 4 Gets MTP Drafters
Google has released Multi-Token Prediction (MTP) drafters for the Gemma 4 model family. The announcement generated nearly 1,000 upvotes on r/LocalLLaMA, making it one of the week's most discussed local LLM releases.
Up to 3x Faster Without Quality Loss
MTP drafters use a specialized speculative decoding architecture: a smaller, faster draft model predicts several tokens ahead, which the target model verifies in parallel. Google reports up to a 3x speedup in tokens-per-second with no degradation in output quality or reasoning capability.
Technical Background
Standard LLM inference is memory-bandwidth bound — the processor spends most of its time moving billions of parameters from VRAM to compute units to generate each token. MTP relieves this bottleneck by reducing the number of full-model passes required, making better use of underutilized compute.
Available Models and Platforms
The released drafters support Gemma 4 31B-IT, 26B-A4B-IT (MoE), E4B, and E2B. All are available on Hugging Face. Speed improvements have been tested on LiteRT-LM, MLX, Hugging Face Transformers, and vLLM. Gemma 4, Google's most capable open model to date, reached 60 million downloads in just three weeks since launch.
Related Articles
Google has released open-weight MTP drafter models for Gemma 4 31B and 26B-A4B, enabling speculative decoding to significantly boost inference speed without affecting output quality.
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
llama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
Comments (0)
No comments yet. Be the first to comment!