LocalLLaMA User Says Gemma 4 26B A3B Finally Makes Local Tool Calling Feel Stable
Original: Gemma 4 26b A3B is mindblowingly good , if configured right View original →
A popular LocalLLaMA post is getting attention not because it offers a polished benchmark sheet, but because it sounds like a practitioner report from someone trying to make local agents actually usable day to day. The author says they spent several days testing models and quants on an RTX 3090 through LM Studio and kept running into the kinds of failures local-model users care about most: broken tool calling, infinite loops, and prompt-caching slowdowns once conversations became large.
The claim is that Gemma 4 26B A3B behaved differently once configured carefully. According to the post, flash attention plus q4-style quants allowed the model to hold up at long context, prompt caching worked reliably in the stack they were using, and function calling stopped feeling fragile. The poster says they preferred an Unsloth q3k_m quant with temperature 1 and top-k 40, alongside a custom system prompt tailored for their workflow.
The most concrete details are hardware and workflow specific. The writer reports around 80 to 110 tokens per second, says their 24 GB RTX 3090 could push toward the model’s maximum 260k context, and describes using the setup with OpenCode for about six hours while exploring and explaining a 2.7 GB repository. They also note that VRAM requirements are still heavy, and that a 16 GB card may be workable for some tasks but is less attractive for agentic or tool-calling use where a large working context matters.
What makes the post notable
- It is about stability and workflow fit, not just leaderboard positioning.
- The runtime stack and quantization choices appear almost as important as the base model itself.
- The strongest claim is practical: local repo navigation and tool use felt reliable enough to keep using.
This is still a community report, not a controlled evaluation, so the numbers should be read as anecdotal and configuration dependent. Even so, the response to the post shows where local-LLM expectations are moving: people want models that can survive long sessions, call tools correctly, and reason over messy real-world repositories on hardware they already own.
Related Articles
A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.
LocalLLaMA liked this because it was not another vague 'model feels worse' post. The thread isolated a concrete failure mode: nullable JSON Schema shapes were collapsing into empty type fields, and a small Jinja fix made Gemma 4's tool calling behave normally again.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
Comments (0)
No comments yet. Be the first to comment!