Gemini Nano on Pixel gets 50% faster token generation with frozen MTP
Original: Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction View original →
The practical limit for phone-based AI is often token-by-token latency, not just model quality. Google Research published a June 26, 2026 architecture that speeds up Gemini Nano v3 on Pixel 9 and Pixel 10 devices by retrofitting Multi-Token Prediction onto already deployed, frozen models. The goal is faster on-device inference without shipping a separate memory-heavy drafter for each task.
The method targets a familiar bottleneck in autoregressive generation. Standard speculative decoding asks a smaller draft model to propose several future tokens, then has the larger model verify them in parallel. That can help, but on a phone the separate drafter competes for RAM and lacks direct access to the main model’s internal state. Google’s MTP design attaches a lightweight Transformer head to the final layers of Gemini Nano instead.
The frozen backbone matters because it keeps the optimization narrow. Google freezes the already trained Gemini Nano v3 weights and trains only the MTP head to predict future tokens. During verification, incorrect drafts are discarded, so the final output remains bit-for-bit identical to the main model. That makes the change an efficiency update rather than a behavioral model update, preserving the base model’s existing capabilities and safety alignment.
Google also redesigned the memory path for mobile constraints. The MTP head cross-attends directly to the main model’s frozen KV cache, so it can use context already computed by the backbone instead of building its own duplicate history. This zero-copy design removes extra drafter prefill latency and saves 130MB per instance compared with a standalone drafter, according to Google.
The reported production impact is sizable. In Pixel 9 experiments, MTP delivered token-generation speedups of 50% or more depending on the task. In workloads such as AI Notification Summaries and Proofread, the system correctly predicts an average of nearly two additional tokens per inference pass. Fewer verification passes also mean less time waking heavy processors, which can reduce energy use. The primary source is Google Research’s MTP write-up.
Related Articles
Google’s I/O 2026 AI story is about distribution as much as models. Gemini 3.5 Flash is now generally available across API, Antigravity, Android Studio, enterprise tools, Search, and the Gemini app, while Gemini Omni Flash brings video generation into the same push.
Google Research is turning enterprise RAG into an iterative agent workflow, not a one-shot retrieval step. Its sufficient-context check lifted factuality accuracy by up to 34% and reached 90.1% accuracy in a cross-corpus FramesQA setup.
Local multimodal AI is moving into the 12B class. Google Gemma introduced Gemma 4 12B under Apache 2.0, describing a unified encoder-free design for image, audio, and text inputs.