r/LocalLLaMA Tracks llama.cpp's New Reasoning Budget Controls
Original: Llama.cpp now with a true reasoning budget! View original →
Why LocalLLaMA cared
Local reasoning models are powerful, but they also waste time and tokens on simple questions. That is why the r/LocalLLaMA thread on a new llama.cpp commit got immediate traction. The change adds a real reasoning-budget sampler, new parser and chat handling, and explicit start/end tag support so llama.cpp can count tokens inside a reasoning block and terminate it when the configured budget is exhausted.
The post is valuable because it includes failure data rather than just a feature announcement. The author says a hard budget hurt Qwen3 9B on HumanEval: the full reasoning setup scored 94%, the non-reasoning version scored 88%, and a forced cutoff collapsed to 78%. To reduce that damage, the patch also introduces --reasoning-budget-message, which inserts a handoff line right before the end of thinking. With a budget of 1000 tokens and a transition message, the reported HumanEval score recovered to 89%.
What the thread surfaced
Comments quickly moved from "nice feature" to control theory for local inference. People suggested gradually biasing the closing reasoning token instead of forcing a hard stop, pointed out naming differences between CLI and HTTP fields, and highlighted the practical win for home setups where a model can spend 80 seconds thinking through a trivial prompt. The broad consensus was that local users need more than an on/off switch. They need a way to trade latency, energy use, and answer quality in a controlled way.
That makes this release more important than a small parser tweak. It is a sign that local inference stacks are starting to expose the same operational controls that hosted reasoning APIs already need. For people running llama.cpp on MacBooks, desktops, or small servers, reasoning budget is not just a benchmark knob. It is a usability feature.
Related Articles
A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
Comments (0)
No comments yet. Be the first to comment!