QVAC TurboQuant attacks local LLMs’ KV-cache memory wall

The hard limit for local LLMs is not only whether the model weights fit. Long conversations, codebases, and documents fill the KV cache, and that runtime memory often becomes the real ceiling. QVAC SDK 0.12.0 takes direct aim at that second wall by adding TurboQuant as an opt-in feature.

TurboQuant is a KV-cache quantization algorithm from Google Research, published at ICLR 2026. QVAC says its implementation compresses 16-bit KV cache values to roughly 3 bits while preserving accuracy across long-context benchmarks including LongBench, ZeroSCROLLS, RULER, L-Eval, and NIAH. The practical detail matters: it works on standard transformer models loaded as GGUF, without retraining, calibration, or fine-tuning.

The numbers explain why local-AI developers are watching it. QVAC says Qwen3.5-4B at 262K tokens stores about 8GB of KV data at 16-bit precision. Its SDK 0.12.0 estimates show an RTX 5060 8GB moving from roughly 120K tokens of context to the full 262K with TurboQuant. An RTX 5070 12GB moves from about 250K to 262K. Larger systems such as RTX 5090 32GB or AMD Strix Halo 128GB already reach the full context in the example, but still save memory budget.

The release is not universal yet. QVAC says TurboQuant currently supports AMD and NVIDIA GPUs, with iOS, Android, and Apple Silicon support still pending. That keeps the near-term story grounded: this is less about every phone suddenly running huge assistants and more about local coding assistants, long-document analysis, and on-prem inference becoming feasible on cheaper hardware. The broader stake is clear, though. Long context has been a cloud feature because clouds had the memory. KV-cache compression starts to narrow that gap.

QVAC TurboQuant attacks local LLMs’ KV-cache memory wall

Related Articles

Colibri Runs GLM-5.2 on a Slow PC, and the Real Debate Is Memory Movement

LocalLLaMA Treats TurboQuant-on-Mac as a Real Consumer-Hardware Signal

Hacker News Highlights Lemonade as a Local AI Server for GPUs and NPUs

Related Articles

Colibri Runs GLM-5.2 on a Slow PC, and the Real Debate Is Memory Movement
LLM Hacker News Jul 10, 2026 1 min read

LocalLLaMA Treats TurboQuant-on-Mac as a Real Consumer-Hardware Signal
LLM Reddit Apr 3, 2026 2 min read

Hacker News Highlights Lemonade as a Local AI Server for GPUs and NPUs
LLM Hacker News Apr 3, 2026 1 min read