NVIDIA launches Nemotron 3 Super for multi-agent AI workloads
Original: Introducing NVIDIA Nemotron 3 Super 🎉 Open 120B-parameter (12B active) hybrid Mamba-Transformer MoE model Native 1M-token context Built for compute-efficient, high-accuracy multi-agent applications Plus, fully open weights, datasets and recipes for easy customization and deployment. 🧵 View original →
What NVIDIA announced on X
On March 11, 2026, NVIDIA AI Developer introduced Nemotron 3 Super as an open 120B-parameter hybrid Mamba-Transformer MoE model with 12B active parameters at inference. The X post framed the release around a few practical claims that matter for agent builders: a native 1M-token context window, an architecture tuned for compute-efficient multi-agent work, and open weights, datasets, and recipes for customization and deployment.
That positioning makes this more than a benchmark-oriented model refresh. NVIDIA is explicitly targeting the cost structure of long-running agent systems, where context grows quickly, intermediate reasoning piles up, and throughput becomes a gating factor for real-world deployment.
What the official NVIDIA blog adds
NVIDIA’s launch post says Nemotron 3 Super delivers up to 5x higher throughput and up to 2x higher accuracy than the previous Nemotron Super model. The company attributes that to a hybrid architecture that combines Mamba layers, transformer reasoning, sparse MoE activation, and multi-token prediction. NVIDIA also says the model is optimized for NVIDIA Blackwell, can run in NVFP4 precision, and is intended to reduce the “context explosion” and “thinking tax” that make multi-agent systems expensive and slow.
- NVIDIA says the model reaches the top spot on Artificial Analysis for efficiency and openness among similarly sized models.
- The post says Nemotron 3 Super powers NVIDIA AI-Q to No. 1 on DeepResearch Bench and DeepResearch Bench II.
- NVIDIA says it is releasing open weights under a permissive license along with methodology, more than 10 trillion tokens of pre- and post-training datasets, 15 reinforcement-learning training environments, and evaluation recipes.
Why this matters for AI teams
The most important operational detail is the combination of long context and sparse activation. A 1M-token window is only useful if teams can actually afford to keep large workflow state in memory. By pairing that window with 12B active parameters, NVIDIA is arguing that developers do not have to choose as sharply between long-horizon context and practical inference cost.
The launch is also a supply-side signal for the open-model ecosystem. If the open weights, training recipes, and evaluation artifacts are genuinely usable, teams building coding agents, research agents, and retrieval-heavy enterprise systems get a model family that is easier to audit, fine-tune, and benchmark against proprietary alternatives. The real test will be whether production users see the claimed speed and reasoning gains on their own agent stacks, but the release is clearly aimed at that exact workload class.
Sources: NVIDIA AI Developer X post, NVIDIA launch blog
Related Articles
Microsoft says Fireworks AI is now part of Microsoft Foundry, bringing high-performance, low-latency open-model inference to Azure. The launch emphasizes day-zero access to leading open models, custom-model deployment, and enterprise controls in one place.
A high-engagement r/LocalLLaMA thread tracked the MiniMax-M2.5 release on Hugging Face. The model card emphasizes agentic coding/search benchmarks, runtime speedups, and aggressive cost positioning.
In a February 12, 2026 post, NVIDIA said major inference providers are reducing token costs with open-source frontier models on Blackwell. The article includes partner-reported gains across healthcare, gaming, and enterprise support workloads.
Comments (0)
No comments yet. Be the first to comment!