Hacker News picks up a practical Gemma 4 local-agent recipe for moving Codex CLI off the cloud
Original: I ran Gemma 4 as a local model in Codex CLI View original →
A new Hacker News thread pushed attention toward Daniel Vaughan’s April 2026 experiment on running Gemma 4 locally inside Codex CLI. The question was practical, not aspirational: can a local model replace a cloud model for everyday agentic coding, where the model has to read files, emit tool calls correctly, write code, and survive long prompts? Cost, privacy, and resilience all argue for local deployment, but only if tool calling is reliable enough to make the agent useful.
Vaughan tested two setups. The first was a 24 GB M4 Pro MacBook Pro using the 26B MoE variant through llama.cpp. The second was a Dell Pro Max GB10 using the 31B Dense variant. On Apple Silicon, the easiest path failed quickly. Ollama v0.20.3 reportedly had a streaming bug that placed Gemma 4 tool-call responses in the wrong field, and a Flash Attention freeze that broke long prompts. Because Codex CLI already ships with a roughly 27,000-token system prompt, those failures made the simple path unusable.
The working Mac path required a more careful stack: llama.cpp with --jinja, a single slot via -np 1, KV-cache quantization through -ctk q8_0 and -ctv q8_0, a direct GGUF path passed with -m, and a 32,768-token context. The Codex CLI profile also needed web_search = "disabled", because llama.cpp rejected the non-function web_search_preview tool type. On the GB10 side, vLLM failed because of a PyTorch ABI mismatch, but Ollama v0.20.5 worked once the port was forwarded locally and Codex was launched in OSS mode.
That combination of failures and workarounds is why the post resonated. It turns “run a local coding agent” from a marketing idea into a reproducible, if still fragile, recipe. The HN discussion treated Gemma 4 26B as unusually strong for its weight class, but the more important takeaway is operational: local agent stacks are now good enough to matter, yet still brittle enough that serving details dominate outcomes. For teams thinking about local-first coding workflows, this is exactly the sort of field report that matters more than a leaderboard screenshot.
Related Articles
A LocalLLaMA post argues that recent llama.cpp fixes justify refreshed Gemma 4 GGUF downloads, especially for users relying on local inference pipelines.
A high-scoring LocalLLaMA post argued that merging llama.cpp PR #21534 finally cleared the known Gemma 4 issues in current master. The community focus was not just the fix itself, but the operational details around tokenizer correctness, chat templates, memory flags, and the warning to avoid CUDA 13.2.
A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.
Comments (0)
No comments yet. Be the first to comment!