r/LocalLLaMA Tracks llama.cpp's New Reasoning Budget Controls

Original: Llama.cpp now with a true reasoning budget! View original →

Read in other languages: 한국어日本語
LLM Mar 12, 2026 By Insights AI (Reddit) 1 min read 2 views Source

Why LocalLLaMA cared

Local reasoning models are powerful, but they also waste time and tokens on simple questions. That is why the r/LocalLLaMA thread on a new llama.cpp commit got immediate traction. The change adds a real reasoning-budget sampler, new parser and chat handling, and explicit start/end tag support so llama.cpp can count tokens inside a reasoning block and terminate it when the configured budget is exhausted.

The post is valuable because it includes failure data rather than just a feature announcement. The author says a hard budget hurt Qwen3 9B on HumanEval: the full reasoning setup scored 94%, the non-reasoning version scored 88%, and a forced cutoff collapsed to 78%. To reduce that damage, the patch also introduces --reasoning-budget-message, which inserts a handoff line right before the end of thinking. With a budget of 1000 tokens and a transition message, the reported HumanEval score recovered to 89%.

What the thread surfaced

Comments quickly moved from "nice feature" to control theory for local inference. People suggested gradually biasing the closing reasoning token instead of forcing a hard stop, pointed out naming differences between CLI and HTTP fields, and highlighted the practical win for home setups where a model can spend 80 seconds thinking through a trivial prompt. The broad consensus was that local users need more than an on/off switch. They need a way to trade latency, energy use, and answer quality in a controlled way.

That makes this release more important than a small parser tweak. It is a sign that local inference stacks are starting to expose the same operational controls that hosted reasoning APIs already need. For people running llama.cpp on MacBooks, desktops, or small servers, reasoning budget is not just a benchmark knob. It is a usability feature.

Commit | Reddit discussion

Share:

Related Articles

LLM Reddit 4d ago 2 min read

A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.

LLM Reddit 15h ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.