LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090

A r/LocalLLaMA benchmark post drew attention because it tackled a question local-model users keep running into: how far can Gemma 4 context length be pushed on a single consumer GPU if KV cache compression is aggressive enough? The author reports running gemma-4-31B-it-UD-Q4_K_XL at a full 256K context on one RTX 5090 by using TurboQuant KV cache compression in a custom llama.cpp fork.

The setup is specific and unusually transparent. The post lists an RTX 5090 with 32GB VRAM, a Ryzen 9 9950X3D, 64GB DDR5, and a Windows 11 build based on TheTom/llama-cpp-turboquant merged with recent Gemma 4 support. The KV cache uses the turbo3 mode, described as roughly 4.5x compression versus f16. The author reports that VRAM usage at 262K context reached 27.7GB, leaving about 4.3GB of headroom on the card.

Prompt processing reportedly measured 3,362.71 tokens/s at 4K context and 899.55 tokens/s at 262K context.
Token generation was reported at 61.51 tokens/s.
The author says 256K context would be impractical on 32GB VRAM without the compressed KV cache.
The post also documents Windows/MSVC build fixes needed to get the fork building correctly for Gemma 4.

The post is valuable because it mixes benchmark data with engineering caveats. The author explicitly notes thermal throttling at 575W, ties the performance curve to quadratic attention costs during prompt processing, and distinguishes that from generation speed, which they describe as memory-bandwidth bound. There is also a low-level debugging note about a std::transform issue with GGUF bool arrays in Release builds that affected Gemma 4's sliding-window attention pattern.

Commenters were interested but skeptical in the right way. The top replies asked how badly quality might degrade under heavy KV quantization and whether the model can still reliably retrieve long-context information after 256K tokens, which is the real test for this class of optimization. That makes the thread useful beyond bragging rights: it is an example of the local LLM community pushing from it fits toward it still works, while sharing enough config and failure details for others to reproduce or challenge the result.

LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality

Related Articles

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing
LLM Reddit Mar 23, 2026 2 min read

Reddit Tracks llama.cpp's attn-rot Push to Raise KV Cache Quality
LLM Reddit Apr 1, 2026 2 min read