QVAC SDK 0.12.0 adds TurboQuant as an opt-in KV-cache compression feature for local LLMs. The company says it can cut runtime context memory by up to 5x and put 262K-token 4B-model contexts within reach of 8GB consumer GPUs.
#turboquant
RSS FeedA LocalLLaMA post claiming a patched llama.cpp could run Qwen 3.5-9B on a MacBook Air M4 with 16 GB memory and a 20,000-token context passed 1,159 upvotes and 193 comments in this April 4, 2026 crawl, making TurboQuant a live local-inference discussion rather than just a research headline.
r/LocalLLaMA is highlighting the merge of llama.cpp PR #21038, which applies a simple Hadamard-based rotation to Q, K, and V in attention as a lightweight path toward TurboQuant-like gains. The appeal is that it improves low-bit cache behavior without introducing a brand-new quantization format.
A high-scoring r/LocalLLaMA post explains TurboQuant not as a polar-coordinates trick but as random rotation before quantization. The linked arXiv paper claims near-optimal distortion rates, a residual QJL stage for inner products, and quality-neutral KV cache quantization at 3.5 bits per channel.
A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.
Hacker News picked up Google Research's TurboQuant because it promises 3-bit KV-cache compression without fine-tuning while targeting both vector search and long-context inference.