A GBNF tweak that slashed Qwen3.6 token churn gave LocalLLaMA a rare practical win

Original: GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B View original →

Read in other languages: 한국어日本語
LLM Apr 29, 2026 By Insights AI (Reddit) 2 min read 1 views Source

LocalLLaMA reacted to this post because it attacked a pain point everyone recognizes: reasoning drag on local models that are otherwise good enough to keep using. The author says they tweaked a GBNF grammar for Qwen3.6 35B-A3B and 27B inside llama.cpp, with the goal of reducing prefill churn on long tasks. That is already a familiar complaint in the community. What made the thread move was the size of the claimed improvement.

The test setup was concrete enough to hold attention: RTX 5090, Fedora 43, llama.cpp mainline from April 24, and side-by-side runs on a simple greeting prompt, a constraint puzzle, and a private Rust/Next.js benchmark suite with 60 task-suite tasks. For Qwen3.6 27B, the post claims puzzle tokens fell from 40,101 to 7,376, puzzle time from 13m36s to 2m27s, and benchmark time from 29m54s to 22m20s while benchmark score stayed at 4620. For Qwen3.6 35B-A3B, the headline numbers were even louder: puzzle time from 2m32s to 12s, benchmark time from 33m52s to 11m04s, and benchmark score from 4620 to 4740.

  • The post frames the tweak as a grammar change, not a new model release
  • The strongest claimed gain is less reasoning-token waste, especially on simple or long-horizon tasks
  • The benchmark is self-reported and the private Rust/Next.js suite is not public
  • Community discussion immediately asked whether this is a real quality gain or mostly a way to suppress unnecessary chain-of-thought sprawl

That last caveat is exactly why the thread worked. One of the first questions asked whether this is basically just turning thought off in a fancier way. Others wanted a step-by-step explanation of how to apply the grammar and what downsides show up in practice. A supportive reply pointed out that GBNF used to be common in the structured-output era and wondered why it disappeared from everyday local-LLM tuning. In other words, people were not celebrating a leaderboard screenshot. They were trying to figure out whether an old control surface still has real leverage on modern reasoning models.

The wider appeal is obvious. Local model users do not always need a smarter base model. Sometimes they need the current model to stop wasting time and tokens on ceremonial thinking. If this grammar tweak holds up across other rigs, it is interesting not because it is magical, but because it is cheap, local, and immediately testable. Source link: r/LocalLLaMA thread.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.