Skip to content
Decaying

r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed

Original: TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) View original →

Read in other languages: 한국어日本語
LLM Mar 28, 2026 By Insights AI (Reddit) 2 min read 66 views Source

Why r/LocalLLaMA cared

The r/LocalLLaMA post that gained traction on March 28, 2026 was not just another paper link. It was an implementation diary from a developer who pushed TurboQuant-style KV cache compression into MLX with custom Metal kernels, then published code and an upstream PR for mlx-lm. That distinction matters to this community because research claims about long-context efficiency only become useful once they survive a real local inference stack on Apple Silicon.

The Reddit post reports a strong headline result on Qwen2.5-32B running on an M4 Pro 48GB: 4.6x KV cache compression, 0.98x FP16 speed, and no observed quality drop, with 16K context memory falling from 4.2GB to 897MB. The accompanying Medium writeup says the biggest gains came from engineering rather than theory alone: fused Metal quantize/dequantize kernels, an incremental decode buffer that avoids reprocessing the full cache every step, and moving bit extraction into GPU code instead of Python. That optimization path reportedly moved the system from 0.28x FP16 speed to near parity.

Where the caveats are

The underlying TurboQuant paper is real and technically interesting. It uses randomized rotation plus quantization to reduce distortion while compressing vectors, and the paper reports near quality-neutral KV cache quantization around 3.5 bits per channel. But the shipping question is more complicated. The repository README shows more conservative numbers on a 7B model in layer-adaptive mode, with 1.9x to 2.4x compression and speed below FP16. That does not invalidate the Reddit result; it shows that model size, layer sensitivity, and implementation details still matter a lot.

That nuance is exactly why the post resonated. LocalLLaMA users do not just want a clever paper, they want a believable path to longer context on consumer hardware without turning decode speed into molasses. The next thing to watch is whether the mlx-lm PR lands cleanly and whether broader perplexity and needle-in-a-haystack tests back up the headline numbers. If they do, TurboQuant on MLX could become one of the more practical Apple Silicon upgrades for local LLM inference in 2026.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment