r/LocalLLaMA Reacts to NVIDIA's Open Nemotron 3 Super Release
Original: Nemotron 3 Super Released View original →
Why the release landed on LocalLLaMA
NVIDIA positioned Nemotron 3 Super as a model for agentic reasoning rather than a generic frontier demo. The official blog says it is a 120B total, 12B active-parameter hybrid Mamba-Transformer MoE built for software development and cybersecurity triaging. NVIDIA also highlights a native 1M-token context window, more than 5x throughput over the previous Nemotron Super, and fully open weights, datasets, and recipes. The framing is explicit: this is supposed to reduce the "thinking tax" that makes multi-agent systems slow and expensive.
What made the LocalLLaMA thread move was not just the headline size. Commenters immediately zeroed in on deployability. People shared Hugging Face links for BF16 and NVFP4 weights, pointed to early GGUF conversions, and compared hardware floors for 64GB-class machines. That is the signature LocalLLaMA response: less interest in press language, more interest in whether the model can actually run.
What stands out technically
NVIDIA says the model combines Mamba layers for sequence efficiency with Transformer layers for precision reasoning, and it pairs that with Blackwell-oriented NVFP4 pretraining plus reinforcement-learning post-training across 21 environment configurations and more than 1.2 million environment rollouts. The openness claim matters more than usual for a model in this size class, because weights, datasets, and recipes give the community room to quantize, adapt, and inspect the release instead of treating it as a sealed checkpoint.
The near-term question is toolchain support. Several Reddit comments noted that mainline llama.cpp support was still catching up, while Unsloth branches and GGUF builds were filling the gap. So the interesting part of this story is not only NVIDIA's architecture pitch. It is whether the open release can move quickly enough into the community stack to become a usable local reasoning option rather than just another model card people admire from a distance.
Related Articles
A March 26, 2026 r/LocalLLaMA post linking NVIDIA's `gpt-oss-puzzle-88B` model card reached 284 points and 105 comments at crawl time. NVIDIA says the 88B MoE model uses its Puzzle post-training NAS pipeline to cut parameters and KV-cache costs while keeping reasoning accuracy near or above the parent model.
Why it matters: post-training agents increasingly depend on reinforcement learning throughput, not only inference speed. NVIDIA says NeMo RL’s FP8 path speeds RL workloads by 1.48x on Qwen3-8B-Base while tracking BF16 accuracy.
HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.
Comments (0)
No comments yet. Be the first to comment!