Local tool calling hit LocalLLaMA’s reality check: model, quant, or harness?
Original: Are you guys actually using local tool calling or is it a collective prank? View original →
Community Spark
A r/LocalLLaMA thread asked whether local tool calling is real or a collective prank, and the question landed because many users have felt the same failure mode. The poster described Open WebUI with Terminal in Docker and models served through LM Studio, then listed Qwen3.5 27B/35B, Gemma4 26B, Qwen3.6 35B and GPT-OSS 20B as models that struggled to create a simple file reliably.
What The Community Blamed First
The most useful replies did not stop at “local models are bad.” Several users pointed at OpenWebUI as the weak link and said OpenCode, Cline in VSCode, llama.cpp or LM Studio’s own runtime had produced better results. One reply said OpenWebUI is fine for chat but weaker for newer models that depend on native tool-call fields and separate reasoning fields. Another said OpenCode had been working well for coding-oriented local tool use.
The Debug Checklist
The thread produced a practical set of variables: avoid very aggressive quants when testing tool use, confirm native tool calling is enabled, check whether the harness returns reasoning in the expected API field, and make sure the tool schema matches what the model has learned. Users also noted that asynchronous shell commands can confuse some wrappers even when the same model behaves better in a coding-specific agent.
Why It Matters
Local agents are often discussed as a model leaderboard problem, but this thread shows the stack is the product. A strong Qwen or Gemma run can still fail if the UI wrapper mishandles tool-call JSON, strips reasoning incorrectly, or keeps the model in an execution loop. The operational lesson is to log the full setup: model, quant, server, runtime, wrapper, tool mode and task. Without that, “local tool calling works” and “local tool calling is broken” are both too vague to be useful.
Source: r/LocalLLaMA discussion.
Related Articles
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A viral LocalLLaMA post describes how Qwen3.6 35B A3B transformed complex workflows by combining Codex for task execution with skill documentation, feeding those skills to the pi agent — automating VPS management, PDF conversion, and more.
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.