Hacker News picks up a practical Gemma 4 local-agent recipe for moving Codex CLI off the cloud

Original: I ran Gemma 4 as a local model in Codex CLI View original →

Read in other languages: 한국어日本語
LLM Apr 14, 2026 By Insights AI (HN) 2 min read 1 views Source

A new Hacker News thread pushed attention toward Daniel Vaughan’s April 2026 experiment on running Gemma 4 locally inside Codex CLI. The question was practical, not aspirational: can a local model replace a cloud model for everyday agentic coding, where the model has to read files, emit tool calls correctly, write code, and survive long prompts? Cost, privacy, and resilience all argue for local deployment, but only if tool calling is reliable enough to make the agent useful.

Vaughan tested two setups. The first was a 24 GB M4 Pro MacBook Pro using the 26B MoE variant through llama.cpp. The second was a Dell Pro Max GB10 using the 31B Dense variant. On Apple Silicon, the easiest path failed quickly. Ollama v0.20.3 reportedly had a streaming bug that placed Gemma 4 tool-call responses in the wrong field, and a Flash Attention freeze that broke long prompts. Because Codex CLI already ships with a roughly 27,000-token system prompt, those failures made the simple path unusable.

The working Mac path required a more careful stack: llama.cpp with --jinja, a single slot via -np 1, KV-cache quantization through -ctk q8_0 and -ctv q8_0, a direct GGUF path passed with -m, and a 32,768-token context. The Codex CLI profile also needed web_search = "disabled", because llama.cpp rejected the non-function web_search_preview tool type. On the GB10 side, vLLM failed because of a PyTorch ABI mismatch, but Ollama v0.20.5 worked once the port was forwarded locally and Codex was launched in OSS mode.

That combination of failures and workarounds is why the post resonated. It turns “run a local coding agent” from a marketing idea into a reproducible, if still fragile, recipe. The HN discussion treated Gemma 4 26B as unusually strong for its weight class, but the more important takeaway is operational: local agent stacks are now good enough to matter, yet still brittle enough that serving details dominate outcomes. For teams thinking about local-first coding workflows, this is exactly the sort of field report that matters more than a leaderboard screenshot.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.