Skip to content
Decaying

Ollama brings NVIDIA’s Nemotron-Cascade-2 into local and agent workflows

Original: Nemotron-Cascade-2 is now available to run with Ollama. ollama run nemotron-cascade-2 To run it locally with OpenClaw: ollama launch openclaw --model nemotron-cascade-2 This model from NVIDIA delivers strong reasoning and agentic capabilities on par with models with up to 20x more parameters. View original →

Read in other languages: 한국어日本語
LLM Mar 21, 2026 By Insights AI 2 min read 65 views Source

What Ollama announced on X

On March 20, 2026, Ollama said Nemotron-Cascade-2 is now available to run through its local model runtime. The post gives the most direct use case immediately: developers can pull the model with ollama run nemotron-cascade-2 and wire it into agent workflows with commands such as ollama launch openclaw --model nemotron-cascade-2.

That matters because the announcement is not about a closed hosted endpoint. It is about making a large reasoning-oriented NVIDIA model easier to drop into local and semi-local development environments. Ollama’s own framing is aggressive: it says the model offers strong reasoning and agentic performance comparable to systems with far larger parameter counts.

What the official model page confirms

Ollama’s model page describes Nemotron-Cascade-2 as an open 30B MoE model from NVIDIA with 3B activated parameters. The page also says the model supports both thinking and instruct modes, which is important for teams that want one model for deeper reasoning passes as well as lower-latency task execution.

  • The model page marks it as a tools-capable model and exposes launch paths into OpenClaw, Codex, and Claude via Ollama’s launcher integrations.
  • It identifies the main downloadable variant as 30b.
  • The page also says Nemotron-Cascade-2-30B-A3B achieved gold medal performance on the 2025 International Mathematical Olympiad and the International Olympiad in Informatics.

In effect, Ollama is packaging a frontier-style reasoning model into a format that is easier to test in local developer loops, agent shells, and custom tooling stacks without depending on a separate proprietary inference surface.

Why this matters

The local model ecosystem is moving from small convenience models toward serious reasoning systems, and this release is a strong example of that shift. A 30B MoE model with only 3B activated parameters suggests a design optimized for capability without requiring the full runtime cost of a dense model at the same nominal size. That makes it more practical for experimentation and for agent workflows where many calls accumulate quickly.

It also reflects a second industry trend: model value increasingly depends on surrounding workflow support. Ollama is not only listing a model; it is showing how the model fits into tools developers already use for coding and agent orchestration. That shortens the distance between “interesting model release” and “something teams can actually evaluate in their own environment.”

Sources: Ollama X post · Ollama model page

Share: Long

Related Articles

LLM Hacker News Apr 16, 2026 2 min read

HN reacted because this was less about one wrapper and more about who gets credit and control in the local LLM stack. The Sleeping Robots post argues that Ollama won mindshare on top of llama.cpp while weakening trust through attribution, packaging, cloud routing, and model storage choices, while commenters pushed back that its UX still solved a real problem.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment