A LocalLLaMA thread reported a large prompt-processing speedup on Qwen3.5-27B by lowering llama.cpp `--ubatch-size` to 64 on an RX 9070 XT. The interesting part is not a universal magic number, but the reminder that prompt ingestion and token generation can respond very differently to `n_ubatch` tuning.
#performance
A community developer achieved 100+ t/s decode speed and 585 t/s aggregate throughput for 8 simultaneous requests running Qwen3.5 27B on a dual RTX 3090 setup with NVLink, using vLLM with tensor parallelism and MTP optimization.
Ollama 0.17, released February 22, introduces a new native inference engine replacing llama.cpp server mode, delivering up to 40% faster prompt processing and 18% faster token generation on NVIDIA GPUs, plus improved multi-GPU tensor parallelism and AMD RDNA 4 support.
A researcher dramatically improved 15 LLMs' coding performance with a single change. By redesigning the edit tool rather than the model, Grok Code Fast's success rate jumped 10x from 6.7% to 68.3%.
Digital Foundry testing confirms that new DRM added to Resident Evil 4 Remake's PC version negatively affects CPU performance.