LocalLLaMA가 주목한 Mamba-3, inference 효율 중심으로 설계된 state space model

LocalLLaMA discussion는 2026년 3월 18일 Together AI의 Mamba-3 발표를 다시 끌어올렸다. 이번 수집 시점 기준으로 해당 Reddit post는 159 upvotes와 21 comments를 기록했다. 원문은 2026년 3월 17일 공개된 research post로, Carnegie Mellon University, Princeton, Cartesia AI, Together AI 연구진이 함께 썼다. 핵심 메시지는 분명하다. 이번 세대의 state space model은 training speed보다 inference efficiency를 먼저 놓고 다시 설계됐다는 것이다.

블로그에 따르면 Mamba-3는 핵심 recurrence를 세 방향에서 바꿨다. 첫째, exponential-trapezoidal discretization에서 나온 더 expressive한 recurrence를 사용한다. 둘째, complex-valued state tracking을 도입해 표현력을 넓힌다. 셋째, 여러 SSM을 병렬로 다루는 MIMO variant를 추가해 decode latency를 크게 늘리지 않으면서 accuracy를 끌어올린다. 여기에 BCNorm 또는 QKNorm 계열 안정화 기법을 넣고, 이전 세대에서 쓰이던 short causal convolution을 제거해 전체 architecture도 보다 현대적인 language model stack에 가깝게 다듬었다.

왜 커뮤니티가 주목했나

저자들은 1.5B scale에서 Mamba-3 SISO가 모든 tested sequence length에서 Mamba-2, Gated DeltaNet, Llama-3.2-1B보다 prefill+decode latency가 더 빠르다고 설명한다.
MIMO variant는 decode latency 증가 없이 accuracy를 높이는 방향으로 제시된다.
팀은 Triton, TileLang, CuTe DSL로 만든 kernel을 open-source로 공개했다.
논문의 문제의식 자체가 RLVR rollout, coding, math, agent workflow처럼 inference-heavy한 workload에 맞춰져 있다.

이 framing은 왜 LocalLLaMA 사용자들이 관심을 보였는지 설명해 준다. 오픈 모델 커뮤니티는 지난 1년 동안 pretraining throughput만이 아니라 serving cost, token latency, local deployment trade-off에 집중해 왔다. Mamba-3 저자들도 post-training, coding, math rollout, agent system이 모두 inference demand를 빠르게 키우고 있다고 본다. 그런 환경에서는 Transformer를 완전히 대체하지 않더라도, quality-efficiency frontier를 앞으로 밀 수 있는 linear architecture 자체가 중요해진다.

물론 trade-off는 남아 있다. 블로그는 pure linear model이 retrieval-heavy task에서는 여전히 Transformer보다 약하다고 인정한다. history를 growing KV cache로 저장하는 대신 fixed-size state로 압축해야 하기 때문이다. 그래서 저자들은 장기적으로 linear layer와 self-attention을 섞은 hybrid model이 더 유력한 방향이라고 예측한다. 이 점이 이번 Reddit post를 단순한 "new model release" 이상으로 만든다. open LLM inference가 다음에 어디로 갈지에 대한 구체적인 architectural bet를 보여 주기 때문이다.

Sources: Together AI Mamba-3 blog, r/LocalLLaMA discussion, Mamba-3 paper

LocalLLaMA가 주목한 Mamba-3, inference 효율 중심으로 설계된 state space model

왜 커뮤니티가 주목했나

Related Articles

Orthrus-Qwen3: 동일 출력 품질 유지하며 추론 속도 7.8배 향상

오픈소스 Forge, 8B 모델 정확도 53%→99%로 끌어올린 가드레일 프레임워크

Tiny-vLLM, C++와 CUDA로 LLM inference를 끝까지 따라가는 교재형 엔진

Related Articles

Orthrus-Qwen3: 동일 출력 품질 유지하며 추론 속도 7.8배 향상
LLM Hacker News May 16, 2026 1 min read

오픈소스 Forge, 8B 모델 정확도 53%→99%로 끌어올린 가드레일 프레임워크
LLM Hacker News May 20, 2026 1 min read

Tiny-vLLM, C++와 CUDA로 LLM inference를 끝까지 따라가는 교재형 엔진
LLM Hacker News May 31, 2026 1 min read