Hacker News Flags Mamba-3 as an Inference-First State Space Model Push
Original: Mamba-3 View original →
Mamba-3 is positioned as an inference-first SSM release
On March 19, 2026, Together AI published Mamba-3 with collaborators from Carnegie Mellon University, Princeton University, and Cartesia AI. The core pitch is clear: Mamba-3 is a new state space model architecture designed around inference efficiency rather than training speed. That framing resonated on Hacker News because current AI deployment pressure comes less from one-time pretraining and more from post-training, RL with verifiable rewards, and agentic workflows that keep models decoding for long periods.
The design changes are more than a minor kernel refresh. Together AI says Mamba-3 improves Mamba-2 in three ways: a more expressive recurrence built from an exponential-trapezoidal discretization scheme, complex-valued state tracking, and a MIMO variant that improves quality without adding decode latency. The release also removes the short causal convolution used in earlier Mamba layers and adds newer components such as QKNorm and RoPE-style handling for the complex system.
Why the benchmarks mattered to HN readers
The benchmark claim that likely pulled Hacker News readers in is the latency table at the 1.5B scale. Together AI reports that Mamba-3 SISO beats Mamba-2, Gated DeltaNet, and a Transformer baseline based on Llama-3.2-1B plus vLLM on prefill plus decode latency across all tested sequence lengths from 512 to 16,384. At sequence length 16,384, the published numbers are 140.61 seconds for Mamba-3 SISO, 149.02 for Mamba-2, 145.87 for Gated DeltaNet, and 976.50 for the Transformer baseline.
That does not mean the Transformer story is over. Together AI also says pure Transformers still do better on retrieval-heavy tasks, while linear models continue to live with a fixed-size state that cannot preserve history in the same way as a KV cache. The more realistic takeaway is that hybrid designs may matter most: linear layers for cheaper memory and decode behavior, paired with self-attention where exact retrieval matters.
Open kernels are part of the real announcement
Another reason the post landed well on Hacker News is that it did not stop at architecture claims. Together AI open-sourced the kernels and described a mixed implementation stack built on Triton, TileLang, and CuTe DSL. That matters for practitioners because inference improvements only change deployment economics when the kernels are actually available, not just described in a paper. In that sense, Mamba-3 reads less like a speculative architecture note and more like a push to make inference-first linear models practical in real systems.
Source: Together AI. Hacker News discussion: item 47419391.
Related Articles
A LocalLLaMA thread on March 18, 2026 pushed fresh attention toward Mamba-3, a new state space model release from researchers at Carnegie Mellon University, Princeton, Cartesia AI, and Together AI. The project shifts its design goal from training speed to inference efficiency and claims prefill+decode latency wins over Mamba-2, Gated DeltaNet, and Llama-3.2-1B at the 1.5B scale.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
Comments (0)
No comments yet. Be the first to comment!