r/LocalLLaMA thread가 RotorQuant report에 주목한 이유는 LLM inference의 특정 bottleneck을 정면으로 겨냥하기 때문이다. 주제는 KV cache compression이다. Scrya는 RotorQuant를 Google의 TurboQuant를 다시 설계한 접근으로 소개한다. 핵심은 d x d random orthogonal matrix를 곱하는 대신 Clifford rotors in Cl(3,0)를 사용해 rotor sandwich product를 계산하는 것이다.

속도 주장의 근거도 비교적 명확하다. Scrya 페이지는 d=128에서 dense matrix path가 16,384 multiply-adds를 요구하는 반면, rotor 접근은 약 100 multiply-adds로 충분하다고 적는다. 같은 페이지는 fused kernel 기준으로 NVIDIA CUDA에서 10-19x, Apple Metal에서 9-31x speedup을 주장하며, parameter count는 16,399에서 372로 줄어 44x fewer parameters가 된다고 설명한다. Qwen2.5-3B-Instruct의 실제 KV cache data에서는 attention fidelity가 0.990 대 0.991 수준으로 TurboQuant와 거의 비슷하다고 제시한다.

왜 LocalLLaMA가 반응했나

LocalLLaMA 독자들이 이런 주제에 민감한 이유는 local inference 병목이 model size 자체보다 memory movement와 kernel efficiency에서 자주 나오기 때문이다. Reddit post는 fused kernel이 BLAS GEMM의 memory round-trip을 줄이고 register 안에서 더 많은 작업을 처리한다는 점을 강조했다. 또한 모든 tested bit-width에서 9/9 needle-in-haystack을 기록했다고 적고, QJL correction을 쓰면 실제 model retrieval quality가 TurboQuant baseline과 비슷하거나 일부 top-1 또는 top-5 retrieval에서 더 좋을 수 있다고 주장했다.

다만 thread는 이를 무비판적으로 받아들이지 않았다. post 자체도 random unit vectors에 대한 synthetic MSE가 더 높다고 적고 있으며, 댓글에서도 이것이 정말 TurboQuant의 theoretical drop-in replacement인지, 아니면 실제 분포에서만 잘 작동하는 engineering trade인지 질문이 나왔다. 이 caveat는 중요하다. RotorQuant의 가치는 tradeoff를 없앤다는 데 있지 않다. global rotation의 수학적 비용을 훨씬 싼 structured operation으로 바꾸면서도 real-model attention fidelity를 실용적으로 유지할 수 있는지 시험한다는 데 있다.

그래서 이 포스트가 의미를 가진다. LocalLLaMA 입장에서는 단순한 benchmark marketing보다, KV cache compression이 consumer NVIDIA card와 Apple Silicon에서도 충분히 빠를 수 있는지 여부가 더 중요하다. 만약 보고된 speedup이 project page 밖에서도 유지된다면, RotorQuant는 quantizer 자체만이 아니라 kernel design과 algebraic structure가 앞으로의 LLM efficiency를 크게 바꿀 수 있다는 방향을 보여준다.

#rotorquant

LocalLLaMA가 주목한 RotorQuant, KV cache compression을 Clifford rotors로 다시 쓰다