A GBNF tweak that slashed Qwen3.6 token churn gave LocalLLaMA a rare practical win
Original: GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B View original →
LocalLLaMA reacted to this post because it attacked a pain point everyone recognizes: reasoning drag on local models that are otherwise good enough to keep using. The author says they tweaked a GBNF grammar for Qwen3.6 35B-A3B and 27B inside llama.cpp, with the goal of reducing prefill churn on long tasks. That is already a familiar complaint in the community. What made the thread move was the size of the claimed improvement.
The test setup was concrete enough to hold attention: RTX 5090, Fedora 43, llama.cpp mainline from April 24, and side-by-side runs on a simple greeting prompt, a constraint puzzle, and a private Rust/Next.js benchmark suite with 60 task-suite tasks. For Qwen3.6 27B, the post claims puzzle tokens fell from 40,101 to 7,376, puzzle time from 13m36s to 2m27s, and benchmark time from 29m54s to 22m20s while benchmark score stayed at 4620. For Qwen3.6 35B-A3B, the headline numbers were even louder: puzzle time from 2m32s to 12s, benchmark time from 33m52s to 11m04s, and benchmark score from 4620 to 4740.
- The post frames the tweak as a grammar change, not a new model release
- The strongest claimed gain is less reasoning-token waste, especially on simple or long-horizon tasks
- The benchmark is self-reported and the private Rust/Next.js suite is not public
- Community discussion immediately asked whether this is a real quality gain or mostly a way to suppress unnecessary chain-of-thought sprawl
That last caveat is exactly why the thread worked. One of the first questions asked whether this is basically just turning thought off in a fancier way. Others wanted a step-by-step explanation of how to apply the grammar and what downsides show up in practice. A supportive reply pointed out that GBNF used to be common in the structured-output era and wondered why it disappeared from everyday local-LLM tuning. In other words, people were not celebrating a leaderboard screenshot. They were trying to figure out whether an old control surface still has real leverage on modern reasoning models.
The wider appeal is obvious. Local model users do not always need a smarter base model. Sometimes they need the current model to stop wasting time and tokens on ceremonial thinking. If this grammar tweak holds up across other rigs, it is interesting not because it is magical, but because it is cheap, local, and immediately testable. Source link: r/LocalLLaMA thread.
Related Articles
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
Comments (0)
No comments yet. Be the first to comment!