r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets
Original: Qwen3.5 Best Parameters Collection View original →
On March 20, 2026, a r/LocalLLaMA thread titled "Qwen3.5 Best Parameters Collection" reached 123 points and 47 comments. The timing matters because Qwen3.5 had been out for a few weeks: enough time for quantizations, runtimes, and sampler settings to settle, but still early enough that many users were comparing notes rather than following a stable consensus. The original post asked for working presets by use case and shared one starting configuration for Qwen3.5-35B-A3B on llama.cpp v8400, using temp 0.7, top-p 0.8, top-k 20, presence penalty 1.5, repeat penalty 1.0, and a reasoning budget of 1000 for general chat.
What the thread actually surfaced
- Many commenters said the safest baseline is still the official Qwen recommendations in model cards rather than Reddit folklore.
- Several users shared different presets for different jobs: thinking coding, thinking general, instruct creative writing, and instruct coding.
- Reasoning budgets became a major tuning axis, with examples ranging from 4096 to 16384 depending on document length and tolerance for long chains of thought.
- For tool-calling work, some users reported better results in non-thinking mode with tighter repeat penalties, arguing that long reasoning traces slowed the system without improving outcomes.
That pattern is more interesting than any single parameter list. The LocalLLaMA community is treating inference policy as a first-class layer of model performance. The same checkpoint can feel verbose, unstable, or highly capable depending on whether it is asked to code, chat, call tools, or parse a long document. In other words, the argument is shifting from "Which model wins?" to "What operating profile makes this model useful?"
Why the thread matters
Open-weight ecosystems usually go through the same maturity curve. First the attention is on raw benchmark strength. Then it moves to quant quality, runtime support, and context length. After that, users discover that default sampler settings hide a large part of real-world performance. This thread sits squarely in that third phase. It does not produce one universal preset, but it does show a community converging on a more disciplined approach: start from official settings, then branch by task type and reasoning budget instead of chasing a single magic configuration.
That is useful for anyone evaluating local LLM stacks on consumer GPUs. A model that "thinks too much" in general chat may still be the right choice for coding or document analysis if the sampler and reasoning budget are adjusted correctly. The thread is less a leaderboard update than a sign that Qwen3.5 is entering the phase where operating practice matters almost as much as weights.
Sources: r/LocalLLaMA discussion · Unsloth Qwen3.5 documentation
Related Articles
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
Comments (0)
No comments yet. Be the first to comment!