r/LocalLLaMA Tracks llama.cpp's New Reasoning Budget Controls
Original: Llama.cpp now with a true reasoning budget! View original →
Why LocalLLaMA cared
Local reasoning models are powerful, but they also waste time and tokens on simple questions. That is why the r/LocalLLaMA thread on a new llama.cpp commit got immediate traction. The change adds a real reasoning-budget sampler, new parser and chat handling, and explicit start/end tag support so llama.cpp can count tokens inside a reasoning block and terminate it when the configured budget is exhausted.
The post is valuable because it includes failure data rather than just a feature announcement. The author says a hard budget hurt Qwen3 9B on HumanEval: the full reasoning setup scored 94%, the non-reasoning version scored 88%, and a forced cutoff collapsed to 78%. To reduce that damage, the patch also introduces --reasoning-budget-message, which inserts a handoff line right before the end of thinking. With a budget of 1000 tokens and a transition message, the reported HumanEval score recovered to 89%.
What the thread surfaced
Comments quickly moved from "nice feature" to control theory for local inference. People suggested gradually biasing the closing reasoning token instead of forcing a hard stop, pointed out naming differences between CLI and HTTP fields, and highlighted the practical win for home setups where a model can spend 80 seconds thinking through a trivial prompt. The broad consensus was that local users need more than an on/off switch. They need a way to trade latency, energy use, and answer quality in a controlled way.
That makes this release more important than a small parser tweak. It is a sign that local inference stacks are starting to expose the same operational controls that hosted reasoning APIs already need. For people running llama.cpp on MacBooks, desktops, or small servers, reasoning budget is not just a benchmark knob. It is a usability feature.
Related Articles
LocalLLaMA upvoted the merge because it is immediately testable, but the useful caveat was clear: speedups depend heavily on prompt repetition and draft acceptance.
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
Comments (0)
No comments yet. Be the first to comment!