LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090
Original: Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark View original →
A r/LocalLLaMA benchmark post drew attention because it tackled a question local-model users keep running into: how far can Gemma 4 context length be pushed on a single consumer GPU if KV cache compression is aggressive enough? The author reports running gemma-4-31B-it-UD-Q4_K_XL at a full 256K context on one RTX 5090 by using TurboQuant KV cache compression in a custom llama.cpp fork.
The setup is specific and unusually transparent. The post lists an RTX 5090 with 32GB VRAM, a Ryzen 9 9950X3D, 64GB DDR5, and a Windows 11 build based on TheTom/llama-cpp-turboquant merged with recent Gemma 4 support. The KV cache uses the turbo3 mode, described as roughly 4.5x compression versus f16. The author reports that VRAM usage at 262K context reached 27.7GB, leaving about 4.3GB of headroom on the card.
- Prompt processing reportedly measured 3,362.71 tokens/s at 4K context and 899.55 tokens/s at 262K context.
- Token generation was reported at 61.51 tokens/s.
- The author says 256K context would be impractical on 32GB VRAM without the compressed KV cache.
- The post also documents Windows/MSVC build fixes needed to get the fork building correctly for Gemma 4.
The post is valuable because it mixes benchmark data with engineering caveats. The author explicitly notes thermal throttling at 575W, ties the performance curve to quadratic attention costs during prompt processing, and distinguishes that from generation speed, which they describe as memory-bandwidth bound. There is also a low-level debugging note about a std::transform issue with GGUF bool arrays in Release builds that affected Gemma 4's sliding-window attention pattern.
Commenters were interested but skeptical in the right way. The top replies asked how badly quality might degrade under heavy KV quantization and whether the model can still reliably retrieve long-context information after 256K tokens, which is the real test for this class of optimization. That makes the thread useful beyond bragging rights: it is an example of the local LLM community pushing from it fits toward it still works, while sharing enough config and failure details for others to reproduce or challenge the result.
Related Articles
A LocalLLaMA thread spotlighted ggerganov's attn-rot work for llama.cpp, a simple rotation-based approach to improve KV cache quantization without introducing new formats. The appeal is that quality appears to improve sharply at low precision while throughput stays in roughly the same band.
r/LocalLLaMA is highlighting the merge of llama.cpp PR #21038, which applies a simple Hadamard-based rotation to Q, K, and V in attention as a lightweight path toward TurboQuant-like gains. The appeal is that it improves low-bit cache behavior without introducing a brand-new quantization format.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
Comments (0)
No comments yet. Be the first to comment!