Google Releases Multi-Token Prediction Drafters for Gemma 4

MTP Drafters for Gemma 4

Google released assistant models for Gemma 4 31B and 26B-A4B that act as multi-token prediction (MTP) drafters. Available on HuggingFace as gemma-4-31B-it-assistant and gemma-4-26B-A4B-it-assistant, they enable speculative decoding: the drafter proposes multiple tokens at once, and the base model verifies them in a single forward pass.

How It Works

Correctly predicted tokens are accepted outright; errors are corrected by the base model. Output quality is identical to standard inference, while throughput increases roughly 1.5-3x in low-batch real-time scenarios where GPU utilization would otherwise be low.

Growing Ecosystem

Qwen3.5+, DeepSeek V3, and GLM4.5+ already support MTP. Once MTP support lands in llama.cpp, local deployment will become broadly accessible. The LocalLLaMA community is tracking which MTP-capable models will be worth testing first once weights and tooling align.

LLM X/Twitter Apr 6, 2026 2 min read

Google DeepMind launches Gemma 4 open models with Apache 2.0 licensing and native agent features

Google DeepMind’s April 2, 2026 X thread introduced Gemma 4 as a new open model family built for reasoning and agentic workflows. Google says the lineup spans E2B, E4B, 26B MoE, and 31B Dense, and adds native function calling, structured JSON output, and longer context windows.

#google #deepmind #gemma

LLM Apr 13, 2026 2 min read

Google pushes Gemma 4 agentic workflows onto edge devices

Google's AI Edge team said on April 2, 2026 that Gemma 4 is bringing multi-step agentic workflows to phones, desktops, and edge hardware under an Apache 2.0 license. The launch combines open models, Agent Skills, and LiteRT-LM deployment tooling.

#google #gemma #on-device

LLM Reddit 4d ago 2 min read

A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second

LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.

#qwen #gemma #local-llm