LocalLLaMA Tests Qwen3.5-35B-A3B for Agentic Coding, Reports Triple-Digit Token Speeds
Original: Qwen3.5-35B-A3B is a gamechanger for agentic coding. View original →
What the Community Post Claimed
A top LocalLLaMA thread reported aggressive local coding performance from Qwen3.5-35B-A3B. The author described running llama.cpp on a headless Linux box with a single RTX 3090, using an MXFP4 model build and a long context configuration, while citing roughly 22 GB of VRAM usage.
The poster shared concrete launch settings and claimed two practical outcomes: sustained throughput above 100 tokens per second and successful completion of a personal coding evaluation task that had historically taken human candidates several hours. They also described a quick recreation task in an agentic workflow, positioning the model as unusually strong for local open-weight coding use.
Why the Thread Drew Attention
- It combined reproducible setup details with claimed real task outcomes
- It focused on local hardware economics rather than cloud API performance
- It framed results around agent tool usage, not only static benchmark scores
Commenters added a wider evidence set. Some reported similarly high throughput on newer consumer/workstation GPUs. Others saw weaker tool-use behavior despite good code reading quality. Several practitioners highlighted that agent results depend heavily on surrounding system choices: quantization format, framework implementation, number of tools in the schema, and context-management strategy.
How to Read These Results
This is still community evidence, not a controlled benchmark paper. But it is useful evidence because the thread exposes conditions under which local coding models either perform surprisingly well or degrade quickly. The practical message is not simply “this model is fastest,” but that end-to-end agent design now determines whether local LLMs can replace portions of API-first coding loops.
For teams evaluating local deployment, this thread is a reminder to test entire pipelines: model + quant + runtime + tool schema + workload. Qwen3.5-35B-A3B appears capable of strong coding output in tuned environments, yet variance across real setups remains high enough that production decisions should be validated with internal workloads before broad rollout.
Source thread: r/LocalLLaMA discussion
Related model page: Hugging Face - Qwen3.5-35B-A3B
Related Articles
A high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
A Hacker News post surfaced Unsloth's Qwen3.5 local guide, which lays out memory targets, reasoning-mode controls, and llama.cpp commands for running 27B and 35B-A3B models on local hardware.
A well-received PSA on r/LocalLLaMA argues that convenience layers such as Ollama and LM Studio can change model behavior enough to distort evaluation. The more durable lesson from the thread is reproducibility: hold templates, stop tokens, sampling, runtime versions, and quantization constant before judging a model.
Comments (0)
No comments yet. Be the first to comment!