Local tool calling hit LocalLLaMA’s reality check: model, quant, or harness?
Original: Are you guys actually using local tool calling or is it a collective prank? View original →
Community Spark
A r/LocalLLaMA thread asked whether local tool calling is real or a collective prank, and the question landed because many users have felt the same failure mode. The poster described Open WebUI with Terminal in Docker and models served through LM Studio, then listed Qwen3.5 27B/35B, Gemma4 26B, Qwen3.6 35B and GPT-OSS 20B as models that struggled to create a simple file reliably.
What The Community Blamed First
The most useful replies did not stop at “local models are bad.” Several users pointed at OpenWebUI as the weak link and said OpenCode, Cline in VSCode, llama.cpp or LM Studio’s own runtime had produced better results. One reply said OpenWebUI is fine for chat but weaker for newer models that depend on native tool-call fields and separate reasoning fields. Another said OpenCode had been working well for coding-oriented local tool use.
The Debug Checklist
The thread produced a practical set of variables: avoid very aggressive quants when testing tool use, confirm native tool calling is enabled, check whether the harness returns reasoning in the expected API field, and make sure the tool schema matches what the model has learned. Users also noted that asynchronous shell commands can confuse some wrappers even when the same model behaves better in a coding-specific agent.
Why It Matters
Local agents are often discussed as a model leaderboard problem, but this thread shows the stack is the product. A strong Qwen or Gemma run can still fail if the UI wrapper mishandles tool-call JSON, strips reasoning incorrectly, or keeps the model in an execution loop. The operational lesson is to log the full setup: model, quant, server, runtime, wrapper, tool mode and task. Without that, “local tool calling works” and “local tool calling is broken” are both too vague to be useful.
Source: r/LocalLLaMA discussion.
Related Articles
r/LocalLLaMA liked this comparison because it replaces reputation and anecdote with a more explicit distribution-based yardstick. The post ranks community Qwen3.5-9B GGUF quants by mean KLD versus a BF16 baseline, with Q8_0 variants leading on fidelity and several IQ4/Q5 options standing out on size-to-drift trade-offs.
LocalLLaMA upvoted this because it turns a messy GGUF choice into a measurable tradeoff. The post compares community Qwen3.5-9B quants against a BF16 baseline using mean KLD, then the comments push for better visual encoding, Gemma 4 runs, Thireus quants, and long-context testing.
LocalLLaMA reacted because the post attacks a very real pain point for running large MoE models on limited VRAM. The author tested a llama.cpp fork that tracks recently routed experts and keeps the hot ones in VRAM for Qwen3.5-122B-A10B, reporting 26.8% faster token generation than layer-based offload at a similar 22GB VRAM budget.
Comments (0)
No comments yet. Be the first to comment!