LocalLLaMAで注目のMamba-3、inference効率を軸に設計されたstate space model

LocalLLaMAのdiscussion は、2026年3月18日に Mamba-3 への関心を押し上げた。このクロール時点でReddit postは159 upvotesと21 commentsを集めていた。元になっているのは2026年3月17日に公開されたresearch postで、Carnegie Mellon University、Princeton、Cartesia AI、Together AIの研究者が共同で執筆している。狙いは明快で、state space modelをtraining speedではなくinference efficiency中心で作り直すことにある。

blogによれば、Mamba-3はcore recurrenceを3つの方向から刷新している。1つ目は exponential-trapezoidal discretization に基づく、より expressive な recurrence。2つ目は表現力を広げる complex-valued state tracking。3つ目は、decode latencyをほとんど増やさずに複数のSSMを並列に扱うMIMO variantだ。さらに、以前のMamba世代で使われていた short causal convolution を外し、BCNormまたはQKNorm系の安定化を加えることで、全体のarchitectureもより現代的なlanguage model stackに近づけている。

communityが注目した理由

著者らは、1.5B scaleでMamba-3 SISOが、すべてのtested sequence lengthにおいてMamba-2、Gated DeltaNet、Llama-3.2-1Bよりprefill+decode latencyで優位だと述べている。
MIMO variantは、decode latencyを増やさずaccuracyを押し上げる手段として提示されている。
teamはTriton、TileLang、CuTe DSLで作ったkernelをopen-source化した。
問題設定そのものがRLVR rolloutやagent workflowのようなinference-heavy workloadに向いている。

この framing は、なぜ LocalLLaMA の利用者が反応したのかをよく説明している。open model communityはこの1年、pretraining throughputだけでなく、serving cost、token latency、local deploymentのtrade-offに強い関心を向けてきた。Mamba-3の著者も、post-training、coding、math rollout、agent systemが推論需要を急増させていると明示的に述べる。そうした環境では、Transformerを完全に置き換えなくても、quality-efficiency frontierを前へ押し出すlinear architectureには十分な意味がある。

もちろんtrade-offは残る。blogは、pure linear modelがretrieval-heavy taskでは依然としてTransformerに劣ると認めている。growing KV cacheではなく fixed-size state に履歴を圧縮するからだ。そのため著者らは、長期的にはlinear layerとself-attentionを混ぜたhybrid modelが有力だと予測している。この含みがあるからこそ、今回のReddit postは単なる「new model release」以上の価値を持つ。open LLM inferenceが次に向かう先について、具体的なarchitectural betを示しているからだ。

Sources: Together AI Mamba-3 blog, r/LocalLLaMA discussion, Mamba-3 paper

LocalLLaMAで注目のMamba-3、inference効率を軸に設計されたstate space model

communityが注目した理由

Related Articles

Google、Gemini 3.1 Flash-Liteを公開 128k contextと低価格で高頻度処理を狙う

Together AI、Open Deep Research v2を公開 dataset・code・multi-step research workflowをオープン化

Hacker News、transformer内部でprogram executionを行うという Percepta の主張に注目

Comments (0)

Leave a Comment

Related Articles

Google、Gemini 3.1 Flash-Liteを公開 128k contextと低価格で高頻度処理を狙う

Together AI、Open Deep Research v2を公開 dataset・code・multi-step research workflowをオープン化

Hacker News、transformer内部でprogram executionを行うという Percepta の主張に注目