Local models are crossing from hobby setup into coding workflow
Original: Running local models is good now View original →
The renewed interest in local LLMs is not about running a model for novelty. The practical question is whether a developer can put a local model into a real coding workflow without spending more time babysitting it than using it.
Vicki Boykis argues that the answer is starting to become yes for bounded tasks. On a 2022 M2 Mac with 64 GB of RAM, she has tested Mistral 7B, Gemma 3, OpenAI OSS-20B, Qwen 3 MoE, Qwen 2.5 Coder, and several local inference stacks including llama.cpp, Ollama, llamafiles, LM Studio, and llama-cpp-python. Her current setup uses Pi as the agent harness and LM Studio as the local inference server.
The strongest claim is carefully scoped: recent Gemma 4 releases have made local agentic coding feel roughly 75 percent as capable and fast as frontier models for her use. The examples are practical rather than theatrical: refactoring a notebook into modules, tightening Python type hints, writing unit tests, proofreading posts, and bootstrapping a small recommendation-model repository.
The HN discussion added useful friction to that optimism. Commenters pointed out that dense models such as Qwen 27B and larger Gemma variants can be smarter but slow, while MoE models can be faster but more error-prone. Quantization came up repeatedly because many users run 4-bit models to fit local hardware, then hit weaker tool calling or lower reliability. Others argued that local models still lag badly when a task is ambiguous or needs the judgment of a frontier model.
The most convincing pattern is hybrid use. A frontier model can plan or handle ambiguous work, while a local model takes small edits, summaries, code search, documentation questions, or well-specified implementation steps. That split lowers recurring API cost and keeps more code on the user’s machine.
The broader shift is in tooling. LM Studio, llama.cpp, Ollama, Pi, and related harnesses make it easier to inspect prompts, tokens, context windows, quantization choices, and model behavior directly. Local models have not made cloud models obsolete. They have become good enough that developers can decide which parts of the workflow deserve privacy, low marginal cost, and local control.
Related Articles
A high-engagement r/LocalLLaMA thread reports strong early results for Qwen3.5-35B-A3B in local agentic coding workflows. The original poster cites 100+ tokens/sec on a single RTX 3090 setup, while comments show mixed reproducibility and emphasize tooling, quantization, and prompt pipeline differences.
LocalLLaMA users reacted strongly to a small but practical vLLM nightly change. The new Qwen3+ streaming parser is aimed at mid-turn stops and streaming tool-call failures that can break Qwen3.6 agent loops.
LocalLLaMA treated this less as a speed chart and more as a question about completion quality under a messy real prompt. On the same MacBook Pro M5 Max, Qwen 3.6 27B wrote more and faster, but Gemma 4 31B finished the game logic with far fewer tokens.