r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
Original: Qwen3.5 Best Parameters Collection View original →
On March 20, 2026, a r/LocalLLaMA thread titled "Qwen3.5 Best Parameters Collection" reached 123 points and 47 comments. The timing matters because Qwen3.5 had been out for a few weeks: enough time for quantizations, runtimes, and sampler settings to settle, but still early enough that many users were comparing notes rather than following a stable consensus. The original post asked for working presets by use case and shared one starting configuration for Qwen3.5-35B-A3B on llama.cpp v8400, using temp 0.7, top-p 0.8, top-k 20, presence penalty 1.5, repeat penalty 1.0, and a reasoning budget of 1000 for general chat.
What the thread actually surfaced
- Many commenters said the safest baseline is still the official Qwen recommendations in model cards rather than Reddit folklore.
- Several users shared different presets for different jobs: thinking coding, thinking general, instruct creative writing, and instruct coding.
- Reasoning budgets became a major tuning axis, with examples ranging from 4096 to 16384 depending on document length and tolerance for long chains of thought.
- For tool-calling work, some users reported better results in non-thinking mode with tighter repeat penalties, arguing that long reasoning traces slowed the system without improving outcomes.
That pattern is more interesting than any single parameter list. The LocalLLaMA community is treating inference policy as a first-class layer of model performance. The same checkpoint can feel verbose, unstable, or highly capable depending on whether it is asked to code, chat, call tools, or parse a long document. In other words, the argument is shifting from "Which model wins?" to "What operating profile makes this model useful?"
Why the thread matters
Open-weight ecosystems usually go through the same maturity curve. First the attention is on raw benchmark strength. Then it moves to quant quality, runtime support, and context length. After that, users discover that default sampler settings hide a large part of real-world performance. This thread sits squarely in that third phase. It does not produce one universal preset, but it does show a community converging on a more disciplined approach: start from official settings, then branch by task type and reasoning budget instead of chasing a single magic configuration.
That is useful for anyone evaluating local LLM stacks on consumer GPUs. A model that "thinks too much" in general chat may still be the right choice for coding or document analysis if the sampler and reasoning budget are adjusted correctly. The thread is less a leaderboard update than a sign that Qwen3.5 is entering the phase where operating practice matters almost as much as weights.
Sources: r/LocalLLaMA discussion · Unsloth Qwen3.5 documentation
Related Articles
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
LocalLLaMA got animated because the post promised something people can feel immediately: less reasoning drag. A user claims a small GBNF constraint cut Qwen3.6 token burn hard enough to speed up long tasks without wrecking benchmark scores.
A LocalLLaMA user has shared a detailed guide for running Qwen 3.6 27B with Multi-Token Prediction support in llama.cpp, achieving 2.5x inference speedup and 262k context on 48GB of memory.
Comments (0)
No comments yet. Be the first to comment!