#turboquant

LLM Jun 2, 2026 1 min read

QVAC TurboQuant attacks local LLMs’ KV-cache memory wall

QVAC SDK 0.12.0 adds TurboQuant as an opt-in KV-cache compression feature for local LLMs. The company says it can cut runtime context memory by up to 5x and put 262K-token 4B-model contexts within reach of 8GB consumer GPUs.

#qvac #turboquant #local-ai

LLM Reddit Apr 3, 2026 2 min read

LocalLLaMA Treats TurboQuant-on-Mac as a Real Consumer-Hardware Signal

A LocalLLaMA post claiming a patched llama.cpp could run Qwen 3.5-9B on a MacBook Air M4 with 16 GB memory and a 20,000-token context passed 1,159 upvotes and 193 comments in this April 4, 2026 crawl, making TurboQuant a live local-inference discussion rather than just a research headline.

#turboquant #qwen #llama-cpp

LLM Reddit Apr 2, 2026 2 min read

Reddit tracks attn-rot landing in llama.cpp as a low-cost quantization upgrade

r/LocalLLaMA is highlighting the merge of llama.cpp PR #21038, which applies a simple Hadamard-based rotation to Q, K, and V in attention as a lightweight path toward TurboQuant-like gains. The appeal is that it improves low-bit cache behavior without introducing a brand-new quantization format.

#llama.cpp #turboquant #kv-cache

LLM Reddit Mar 29, 2026 2 min read

r/LocalLLaMA compresses TurboQuant into one idea: rotate first, quantize second

A high-scoring r/LocalLLaMA post explains TurboQuant not as a polar-coordinates trick but as random rotation before quantization. The linked arXiv paper claims near-optimal distortion rates, a residual QJL stage for inner products, and quality-neutral KV cache quantization at 3.5 bits per channel.

#turboquant #quantization #kv-cache

LLM Reddit Mar 27, 2026 2 min read

LocalLLaMA Highlights a Sparse V Dequant Trick for TurboQuant in llama.cpp

A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.

#llm-inference #kv-cache #llama-cpp