LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds
Original: Qwen3.5-35B-A3B is a gamechanger for agentic coding. View original →
What the Community Post Claimed
A top LocalLLaMA thread reported aggressive local coding performance from Qwen3.5-35B-A3B. The author described running llama.cpp on a headless Linux box with a single RTX 3090, using an MXFP4 model build and a long context configuration, while citing roughly 22 GB of VRAM usage.
The poster shared concrete launch settings and claimed two practical outcomes: sustained throughput above 100 tokens per second and successful completion of a personal coding evaluation task that had historically taken human candidates several hours. They also described a quick recreation task in an agentic workflow, positioning the model as unusually strong for local open-weight coding use.
Why the Thread Drew Attention
- It combined reproducible setup details with claimed real task outcomes
- It focused on local hardware economics rather than cloud API performance
- It framed results around agent tool usage, not only static benchmark scores
Commenters added a wider evidence set. Some reported similarly high throughput on newer consumer/workstation GPUs. Others saw weaker tool-use behavior despite good code reading quality. Several practitioners highlighted that agent results depend heavily on surrounding system choices: quantization format, framework implementation, number of tools in the schema, and context-management strategy.
How to Read These Results
This is still community evidence, not a controlled benchmark paper. But it is useful evidence because the thread exposes conditions under which local coding models either perform surprisingly well or degrade quickly. The practical message is not simply “this model is fastest,” but that end-to-end agent design now determines whether local LLMs can replace portions of API-first coding loops.
For teams evaluating local deployment, this thread is a reminder to test entire pipelines: model + quant + runtime + tool schema + workload. Qwen3.5-35B-A3B appears capable of strong coding output in tuned environments, yet variance across real setups remains high enough that production decisions should be validated with internal workloads before broad rollout.
Source thread: r/LocalLLaMA discussion
Related model page: Hugging Face - Qwen3.5-35B-A3B
Related Articles
LocalLLaMA reacted because --fit challenged the old rule of thumb that anything outside VRAM means painfully slow inference.
A March 2026 r/LocalLLaMA post with 126 points and 45 comments highlighted a practical guide for running Qwen3.5-27B through llama.cpp and wiring it into OpenCode. The post stands out because it covers the operational details that usually break local coding setups: quant choice, chat-template fixes, VRAM budgeting, Tailscale networking, and tool-calling behavior.
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
Comments (0)
No comments yet. Be the first to comment!