Ollama brings NVIDIA’s Nemotron-Cascade-2 into local and agent workflows
Original: Nemotron-Cascade-2 is now available to run with Ollama. ollama run nemotron-cascade-2 To run it locally with OpenClaw: ollama launch openclaw --model nemotron-cascade-2 This model from NVIDIA delivers strong reasoning and agentic capabilities on par with models with up to 20x more parameters. View original →
What Ollama announced on X
On March 20, 2026, Ollama said Nemotron-Cascade-2 is now available to run through its local model runtime. The post gives the most direct use case immediately: developers can pull the model with ollama run nemotron-cascade-2 and wire it into agent workflows with commands such as ollama launch openclaw --model nemotron-cascade-2.
That matters because the announcement is not about a closed hosted endpoint. It is about making a large reasoning-oriented NVIDIA model easier to drop into local and semi-local development environments. Ollama’s own framing is aggressive: it says the model offers strong reasoning and agentic performance comparable to systems with far larger parameter counts.
What the official model page confirms
Ollama’s model page describes Nemotron-Cascade-2 as an open 30B MoE model from NVIDIA with 3B activated parameters. The page also says the model supports both thinking and instruct modes, which is important for teams that want one model for deeper reasoning passes as well as lower-latency task execution.
- The model page marks it as a tools-capable model and exposes launch paths into OpenClaw, Codex, and Claude via Ollama’s launcher integrations.
- It identifies the main downloadable variant as 30b.
- The page also says Nemotron-Cascade-2-30B-A3B achieved gold medal performance on the 2025 International Mathematical Olympiad and the International Olympiad in Informatics.
In effect, Ollama is packaging a frontier-style reasoning model into a format that is easier to test in local developer loops, agent shells, and custom tooling stacks without depending on a separate proprietary inference surface.
Why this matters
The local model ecosystem is moving from small convenience models toward serious reasoning systems, and this release is a strong example of that shift. A 30B MoE model with only 3B activated parameters suggests a design optimized for capability without requiring the full runtime cost of a dense model at the same nominal size. That makes it more practical for experimentation and for agent workflows where many calls accumulate quickly.
It also reflects a second industry trend: model value increasingly depends on surrounding workflow support. Ollama is not only listing a model; it is showing how the model fits into tools developers already use for coding and agent orchestration. That shortens the distance between “interesting model release” and “something teams can actually evaluate in their own environment.”
Sources: Ollama X post · Ollama model page
Related Articles
A March 15, 2026 Hacker News post about GreenBoost reached 124 points and 25 comments. The open-source Linux project combines a kernel module and CUDA shim to tier model memory across VRAM, DDR4, and NVMe so larger local LLMs can run without changing inference apps.
Alibaba's Qwen team has released Qwen 3.5 Small, a new small dense model in their flagship open-source series. The announcement topped r/LocalLLaMA with over 1,000 upvotes, reflecting the local AI community's enthusiasm for capable small models.
In a February 12, 2026 post, NVIDIA said major inference providers are reducing token costs with open-source frontier models on Blackwell. The article includes partner-reported gains across healthcare, gaming, and enterprise support workloads.
Comments (0)
No comments yet. Be the first to comment!