r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

Original: TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) View original →

Read in other languages: 한국어日本語
LLM Mar 28, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why r/LocalLLaMA cared

The r/LocalLLaMA post that gained traction on March 28, 2026 was not just another paper link. It was an implementation diary from a developer who pushed TurboQuant-style KV cache compression into MLX with custom Metal kernels, then published code and an upstream PR for mlx-lm. That distinction matters to this community because research claims about long-context efficiency only become useful once they survive a real local inference stack on Apple Silicon.

The Reddit post reports a strong headline result on Qwen2.5-32B running on an M4 Pro 48GB: 4.6x KV cache compression, 0.98x FP16 speed, and no observed quality drop, with 16K context memory falling from 4.2GB to 897MB. The accompanying Medium writeup says the biggest gains came from engineering rather than theory alone: fused Metal quantize/dequantize kernels, an incremental decode buffer that avoids reprocessing the full cache every step, and moving bit extraction into GPU code instead of Python. That optimization path reportedly moved the system from 0.28x FP16 speed to near parity.

Where the caveats are

The underlying TurboQuant paper is real and technically interesting. It uses randomized rotation plus quantization to reduce distortion while compressing vectors, and the paper reports near quality-neutral KV cache quantization around 3.5 bits per channel. But the shipping question is more complicated. The repository README shows more conservative numbers on a 7B model in layer-adaptive mode, with 1.9x to 2.4x compression and speed below FP16. That does not invalidate the Reddit result; it shows that model size, layer sensitivity, and implementation details still matter a lot.

That nuance is exactly why the post resonated. LocalLLaMA users do not just want a clever paper, they want a believable path to longer context on consumer hardware without turning decode speed into molasses. The next thing to watch is whether the mlx-lm PR lands cleanly and whether broader perplexity and needle-in-a-haystack tests back up the headline numbers. If they do, TurboQuant on MLX could become one of the more practical Apple Silicon upgrades for local LLM inference in 2026.

Share: Long

Related Articles

LLM Reddit 5d ago 2 min read

A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.

LLM Reddit 1d ago 2 min read

A LocalLLaMA self-post shared an open-source TurboQuant implementation for llama.cpp that skips value dequantization when attention weights are negligible. The author reports a 22.8% decode gain at 32K context on Qwen3.5-35B-A3B over Apple M5 Max, with unchanged perplexity and better needle-in-a-haystack retrieval.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.