LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

A practical LocalLLaMA thread this weekend focused on a very specific llama.cpp knob that many users set once and forget: --ubatch-size. The post came from a user running Qwen3.5-27B Q3_K_S through the ROCm build on Windows 11 with an AMD RX 9070 XT. Their headline claim was simple: reducing n_ubatch to 64 made prompt processing fast enough to feel usable for Claude Code-style workflows, while higher settings had been dragging badly.

The useful part of the post is the benchmark table. Using llama-bench with -b 8192, the user compared n_ubatch values of 4, 8, 64, and 128. On their machine, prompt-processing throughput for pp512 rose from roughly 59.5 tokens/s at 4 and 83.3 tokens/s at 8 to about 582.4 tokens/s at 64, then collapsed to roughly 14.7 tokens/s at 128. Token generation throughput for tg128 stayed almost flat around 26.8 to 27.1 tokens/s. In other words, this was not a general inference boost; it was mostly a prompt-ingestion effect.

There is also a helpful conceptual anchor in llama.cpp itself. In the project’s GitHub discussion and current common.h defaults, n_batch is described as the logical batch size for prompt processing, while n_ubatch is the physical batch size used for computation. The current defaults in common.h are n_batch = 2048 and n_ubatch = 512. That means the Reddit finding should not be read as “64 is always best.” It is better read as a reminder that the physical compute batch can interact sharply with a specific GPU, backend, model quantization, and prompt-heavy workload.

That caution matters. The original poster explicitly says they are not sure whether the sweet spot is tied to the RX 9070 XT’s cache behavior or to some other hardware-specific condition. Still, the result is useful because it separates two things that users often blur together: faster prompt processing and faster token generation are not the same optimization problem. If your local workflow spends more time ingesting long contexts than sampling output, n_ubatch deserves direct profiling.

That is probably why the thread picked up. It is not a grand theory of inference. It is a concrete, reproducible tuning anecdote with numbers, a visible failure mode, and a setting that many local-model users may never have profiled carefully. For builders running larger Qwen checkpoints in llama.cpp, that is enough to make it high-signal.

Community source: r/LocalLLaMA thread
Referenced docs: llama.cpp discussion on batch vs ubatch, current defaults in common.h

LocalLLaMA shares a llama.cpp tuning tip: smaller n_ubatch unlocked much faster Qwen 27B prompt processing

Related Articles

llama.cpp’s Speculative Checkpointing Turned Local Inference Into a Parameter Hunt

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching

Comments (0)

Leave a Comment

Related Articles

llama.cpp’s Speculative Checkpointing Turned Local Inference Into a Parameter Hunt
LLM Reddit Apr 20, 2026 1 min read

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
LLM Reddit Apr 1, 2026 2 min read

LocalLLaMA Tracks a llama.cpp Experiment for CPU-Offloaded Weight Prefetching
LLM Reddit Mar 31, 2026 2 min read