Reddit highlights Gemma 4’s on-device Agent Skills push
Original: Bring state-of-the-art agentic skills to the edge with Gemma 4 View original →
A Reddit post in /r/singularity pulled attention to Google’s April 2, 2026 developer post on Gemma 4 at the edge. The thread had 68 upvotes and 9 comments at crawl time and links to Google’s article about running agentic workflows on-device.
The headline is not just that Gemma 4 can run on smaller hardware. Google is packaging on-device agent behavior in two layers. The first is Agent Skills in Google AI Edge Gallery for iOS and Android. Google describes these as multi-step autonomous workflows that can call skills to reach beyond the base model: querying Wikipedia, turning speech or text into graphs and flashcards, or combining Gemma 4 with text-to-speech, image generation, and music tools. Google also says Gemma 4 supports visual processing and more than 140 languages. The important shift is architectural. Instead of shipping a plain chat model, Google is framing Gemma 4 as a local orchestrator for tool use.
The second layer is LiteRT-LM, which is the deployment runtime. Google says Gemma 4 E2B can run in under 1.5GB of memory on some devices thanks to 2-bit and 4-bit weights plus memory-mapped per-layer embeddings. LiteRT-LM also exposes dynamic context so developers can use the full 128K context window when the hardware can support it. Google’s published benchmark is the key claim here: 4,000 input tokens across two distinct skills in under three seconds.
The hardware coverage is broader than a typical mobile-only demo. Google lists Android, iOS, desktop, web, Raspberry Pi 5, and Qualcomm Dragonwing IQ8 NPU targets. The blog says Raspberry Pi 5 reaches 133 prefill and 7.6 decode tokens per second on CPU, while the Dragonwing IQ8 setup reaches 3,700 prefill and 31 decode tokens per second with NPU acceleration. Google also introduced a litert-lm CLI and Python bindings, with tool calling support carried over from Agent Skills.
The community interest makes sense. If these benchmarks hold outside demos, Gemma 4 could make private, lower-latency agent workflows more practical on edge hardware. The constraint is that the experience depends on Google’s tool and runtime stack, not just the model weights, so developers will need to evaluate the full system rather than the model in isolation.
Related Articles
Google said on April 2, 2026 that Gemma 4 is its most capable open model family so far, built from the same technology base as Gemini 3. Google says the family spans E2B, E4B, 26B MoE, and 31B Dense models, adds function-calling and structured JSON support, and offers up to 256K context with an Apache 2.0 license.
A LocalLLaMA thread highlighted Gemma 4 31B's unexpectedly strong FoodTruck Bench showing, and the discussion quickly turned to long-horizon planning quality and benchmark reliability.
OpenAI Developers said recent Codex usage data suggests developers are handing off long-running work like refactors and architecture planning at the end of the day. In a follow-up reply, the account said tasks started at 11 pm are 60% more likely than other tasks to run for 3+ hours.
Comments (0)
No comments yet. Be the first to comment!