r/LocalLLaMA tracks TurboQuant on MLX as KV cache compression nears FP16 speed
Original: TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed) View original →
Why r/LocalLLaMA cared
The r/LocalLLaMA post that gained traction on March 28, 2026 was not just another paper link. It was an implementation diary from a developer who pushed TurboQuant-style KV cache compression into MLX with custom Metal kernels, then published code and an upstream PR for mlx-lm. That distinction matters to this community because research claims about long-context efficiency only become useful once they survive a real local inference stack on Apple Silicon.
The Reddit post reports a strong headline result on Qwen2.5-32B running on an M4 Pro 48GB: 4.6x KV cache compression, 0.98x FP16 speed, and no observed quality drop, with 16K context memory falling from 4.2GB to 897MB. The accompanying Medium writeup says the biggest gains came from engineering rather than theory alone: fused Metal quantize/dequantize kernels, an incremental decode buffer that avoids reprocessing the full cache every step, and moving bit extraction into GPU code instead of Python. That optimization path reportedly moved the system from 0.28x FP16 speed to near parity.
Where the caveats are
The underlying TurboQuant paper is real and technically interesting. It uses randomized rotation plus quantization to reduce distortion while compressing vectors, and the paper reports near quality-neutral KV cache quantization around 3.5 bits per channel. But the shipping question is more complicated. The repository README shows more conservative numbers on a 7B model in layer-adaptive mode, with 1.9x to 2.4x compression and speed below FP16. That does not invalidate the Reddit result; it shows that model size, layer sensitivity, and implementation details still matter a lot.
That nuance is exactly why the post resonated. LocalLLaMA users do not just want a clever paper, they want a believable path to longer context on consumer hardware without turning decode speed into molasses. The next thing to watch is whether the mlx-lm PR lands cleanly and whether broader perplexity and needle-in-a-haystack tests back up the headline numbers. If they do, TurboQuant on MLX could become one of the more practical Apple Silicon upgrades for local LLM inference in 2026.
Related Articles
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A project post in r/MachineLearning points to mlx-tune, a library that wraps Apple’s MLX stack in an Unsloth-compatible training API for SFT, DPO, GRPO, LoRA, and vision-language fine-tuning on Apple Silicon Macs.
Comments (0)
No comments yet. Be the first to comment!