Skip to content

Gemini Nano on Pixel gets 50% faster token generation with frozen MTP

Original: Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction View original →

Read in other languages: 한국어日本語
LLM Jun 27, 2026 By Insights AI 2 min read 1 views Source

The practical limit for phone-based AI is often token-by-token latency, not just model quality. Google Research published a June 26, 2026 architecture that speeds up Gemini Nano v3 on Pixel 9 and Pixel 10 devices by retrofitting Multi-Token Prediction onto already deployed, frozen models. The goal is faster on-device inference without shipping a separate memory-heavy drafter for each task.

The method targets a familiar bottleneck in autoregressive generation. Standard speculative decoding asks a smaller draft model to propose several future tokens, then has the larger model verify them in parallel. That can help, but on a phone the separate drafter competes for RAM and lacks direct access to the main model’s internal state. Google’s MTP design attaches a lightweight Transformer head to the final layers of Gemini Nano instead.

The frozen backbone matters because it keeps the optimization narrow. Google freezes the already trained Gemini Nano v3 weights and trains only the MTP head to predict future tokens. During verification, incorrect drafts are discarded, so the final output remains bit-for-bit identical to the main model. That makes the change an efficiency update rather than a behavioral model update, preserving the base model’s existing capabilities and safety alignment.

Google also redesigned the memory path for mobile constraints. The MTP head cross-attends directly to the main model’s frozen KV cache, so it can use context already computed by the backbone instead of building its own duplicate history. This zero-copy design removes extra drafter prefill latency and saves 130MB per instance compared with a standalone drafter, according to Google.

The reported production impact is sizable. In Pixel 9 experiments, MTP delivered token-generation speedups of 50% or more depending on the task. In workloads such as AI Notification Summaries and Proofread, the system correctly predicts an average of nearly two additional tokens per inference pass. Fewer verification passes also mean less time waking heavy processors, which can reduce energy use. The primary source is Google Research’s MTP write-up.

Share: Long

Related Articles