LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing
Original: (Llama.cpp) In case people are struggling with prompt processing on larger models like Qwen 27B, here's what helped me out View original →
A practical LocalLLaMA thread this weekend focused on a very specific llama.cpp knob that many users set once and forget: --ubatch-size. The post came from a user running Qwen3.5-27B Q3_K_S through the ROCm build on Windows 11 with an AMD RX 9070 XT. Their headline claim was simple: reducing n_ubatch to 64 made prompt processing fast enough to feel usable for Claude Code-style workflows, while higher settings had been dragging badly.
The useful part of the post is the benchmark table. Using llama-bench with -b 8192, the user compared n_ubatch values of 4, 8, 64, and 128. On their machine, prompt-processing throughput for pp512 rose from roughly 59.5 tokens/s at 4 and 83.3 tokens/s at 8 to about 582.4 tokens/s at 64, then collapsed to roughly 14.7 tokens/s at 128. Token generation throughput for tg128 stayed almost flat around 26.8 to 27.1 tokens/s. In other words, this was not a general inference boost; it was mostly a prompt-ingestion effect.
There is also a helpful conceptual anchor in llama.cpp itself. In the project’s GitHub discussion and current common.h defaults, n_batch is described as the logical batch size for prompt processing, while n_ubatch is the physical batch size used for computation. The current defaults in common.h are n_batch = 2048 and n_ubatch = 512. That means the Reddit finding should not be read as “64 is always best.” It is better read as a reminder that the physical compute batch can interact sharply with a specific GPU, backend, model quantization, and prompt-heavy workload.
That caution matters. The original poster explicitly says they are not sure whether the sweet spot is tied to the RX 9070 XT’s cache behavior or to some other hardware-specific condition. Still, the result is useful because it separates two things that users often blur together: faster prompt processing and faster token generation are not the same optimization problem. If your local workflow spends more time ingesting long contexts than sampling output, n_ubatch deserves direct profiling.
That is probably why the thread picked up. It is not a grand theory of inference. It is a concrete, reproducible tuning anecdote with numbers, a visible failure mode, and a setting that many local-model users may never have profiled carefully. For builders running larger Qwen checkpoints in llama.cpp, that is enough to make it high-signal.
Community source: r/LocalLLaMA thread
Referenced docs: llama.cpp discussion on batch vs ubatch, current defaults in common.h
Related Articles
LocalLLaMA upvoted the merge because it is immediately testable, but the useful caveat was clear: speedups depend heavily on prompt repetition and draft acceptance.
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
A well-received LocalLLaMA post spotlighted a llama.cpp experiment that prefetches weights while layers are offloaded to CPU memory, aiming to recover prompt-processing speed for dense and smaller MoE models at longer contexts.
Comments (0)
No comments yet. Be the first to comment!