Hacker News Flags Mamba-3 as an Inference-First State Space Model Push

Mamba-3 is positioned as an inference-first SSM release

On March 19, 2026, Together AI published Mamba-3 with collaborators from Carnegie Mellon University, Princeton University, and Cartesia AI. The core pitch is clear: Mamba-3 is a new state space model architecture designed around inference efficiency rather than training speed. That framing resonated on Hacker News because current AI deployment pressure comes less from one-time pretraining and more from post-training, RL with verifiable rewards, and agentic workflows that keep models decoding for long periods.

The design changes are more than a minor kernel refresh. Together AI says Mamba-3 improves Mamba-2 in three ways: a more expressive recurrence built from an exponential-trapezoidal discretization scheme, complex-valued state tracking, and a MIMO variant that improves quality without adding decode latency. The release also removes the short causal convolution used in earlier Mamba layers and adds newer components such as QKNorm and RoPE-style handling for the complex system.

Why the benchmarks mattered to HN readers

The benchmark claim that likely pulled Hacker News readers in is the latency table at the 1.5B scale. Together AI reports that Mamba-3 SISO beats Mamba-2, Gated DeltaNet, and a Transformer baseline based on Llama-3.2-1B plus vLLM on prefill plus decode latency across all tested sequence lengths from 512 to 16,384. At sequence length 16,384, the published numbers are 140.61 seconds for Mamba-3 SISO, 149.02 for Mamba-2, 145.87 for Gated DeltaNet, and 976.50 for the Transformer baseline.

That does not mean the Transformer story is over. Together AI also says pure Transformers still do better on retrieval-heavy tasks, while linear models continue to live with a fixed-size state that cannot preserve history in the same way as a KV cache. The more realistic takeaway is that hybrid designs may matter most: linear layers for cheaper memory and decode behavior, paired with self-attention where exact retrieval matters.

Open kernels are part of the real announcement

Another reason the post landed well on Hacker News is that it did not stop at architecture claims. Together AI open-sourced the kernels and described a mixed implementation stack built on Triton, TileLang, and CuTe DSL. That matters for practitioners because inference improvements only change deployment economics when the kernels are actually available, not just described in a paper. In that sense, Mamba-3 reads less like a speculative architecture note and more like a push to make inference-first linear models practical in real systems.

Source: Together AI. Hacker News discussion: item 47419391.

Hacker News Flags Mamba-3 as an Inference-First State Space Model Push

Mamba-3 is positioned as an inference-first SSM release

Why the benchmarks mattered to HN readers

Open kernels are part of the real announcement

Related Articles

LocalLLaMA highlights Mamba-3, a state space model built around inference efficiency

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA highlights Mamba-3, a state space model built around inference efficiency

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup
LLM Reddit Mar 15, 2026 2 min read