LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
Original: (Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out View original →
A practical LocalLLaMA thread this weekend focused on a very specific llama.cpp knob that many users set once and forget: --ubatch-size. The post came from a user running Qwen3.5-27B Q3_K_S through the ROCm build on Windows 11 with an AMD RX 9070 XT. Their headline claim was simple: reducing n_ubatch to 64 made prompt processing fast enough to feel usable for Claude Code-style workflows, while higher settings had been dragging badly.
The useful part of the post is the benchmark table. Using llama-bench with -b 8192, the user compared n_ubatch values of 4, 8, 64, and 128. On their machine, prompt-processing throughput for pp512 rose from roughly 59.5 tokens/s at 4 and 83.3 tokens/s at 8 to about 582.4 tokens/s at 64, then collapsed to roughly 14.7 tokens/s at 128. Token generation throughput for tg128 stayed almost flat around 26.8 to 27.1 tokens/s. In other words, this was not a general inference boost; it was mostly a prompt-ingestion effect.
There is also a helpful conceptual anchor in llama.cpp itself. In the project’s GitHub discussion and current common.h defaults, n_batch is described as the logical batch size for prompt processing, while n_ubatch is the physical batch size used for computation. The current defaults in common.h are n_batch = 2048 and n_ubatch = 512. That means the Reddit finding should not be read as “64 is always best.” It is better read as a reminder that the physical compute batch can interact sharply with a specific GPU, backend, model quantization, and prompt-heavy workload.
That caution matters. The original poster explicitly says they are not sure whether the sweet spot is tied to the RX 9070 XT’s cache behavior or to some other hardware-specific condition. Still, the result is useful because it separates two things that users often blur together: faster prompt processing and faster token generation are not the same optimization problem. If your local workflow spends more time ingesting long contexts than sampling output, n_ubatch deserves direct profiling.
That is probably why the thread picked up. It is not a grand theory of inference. It is a concrete, reproducible tuning anecdote with numbers, a visible failure mode, and a setting that many local-model users may never have profiled carefully. For builders running larger Qwen checkpoints in llama.cpp, that is enough to make it high-signal.
Community source: r/LocalLLaMA thread
Referenced docs: llama.cpp discussion on batch vs ubatch, current defaults in common.h
Related Articles
A r/LocalLLaMA thread is drawing attention to `llama.cpp` pull request #19504, which adds a `GATED_DELTA_NET` op for Qwen3Next-style models. Reddit users reported better token-generation speed after updating, while the PR itself includes early CPU/CUDA benchmark data.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A new llama.cpp change turns <code>--reasoning-budget</code> into a real sampler-side limit instead of a template stub. The LocalLLaMA thread focused on the tradeoff between cutting long think loops and preserving answer quality, especially for local Qwen 3.5 deployments.
Comments (0)
No comments yet. Be the first to comment!