A new r/LocalLLaMA thread argues that NVIDIA's Nemotron-Cascade-2-30B-A3B deserves more attention after quick local coding evals came in stronger than expected. The post is interesting because it lines up community measurements with NVIDIA's own push for a reasoning-oriented open MoE model that keeps activated parameters low.
LLM
RSS FeedA March 21, 2026 Hacker News discussion sent tinygrad's tinybox page back up the front page and put a shipping local AI workstation in front of builders looking beyond rented GPU time. The product pitch is notable because it pairs concrete specs with pricing that targets labs and startups trying to run bigger models on premises.
GitHub said AI coding agents can now invoke secret scanning through the GitHub MCP Server before a commit or pull request. The feature is in public preview for repositories with GitHub Secret Protection enabled.
Google updated Gemini across Docs, Sheets, Slides, and Drive to generate first drafts, build spreadsheets and presentations, and surface cited answers from Drive. The company also said Gemini in Sheets reached 70.48% on SpreadsheetBench.
Ollama said on March 18, 2026 that MiniMax-M2.7 was available through its cloud path and could be launched from Claude Code and OpenClaw. The Ollama library page describes the M2-series model as a coding- and productivity-focused system with strong results on SWE-Pro, VIBE-Pro, Terminal Bench 2, GDPval-AA, and Toolathon.
OpenAI said on March 5, 2026 that GPT-5.4 Thinking and GPT-5.4 Pro were rolling out in ChatGPT, while GPT-5.4 also became available in the API and Codex. OpenAI’s launch page positions GPT-5.4 as a unified frontier model for reasoning, coding, native computer use, and long-horizon agent workflows.
A Reddit thread in r/LocalLLaMA spotlighted mlx-lm PR #990, which uses Qwen3.5's built-in MTP head for native speculative decoding and reports 15.3 -> 23.3 tok/s (~1.5x throughput boost) with ~80.6% acceptance rate on Qwen3.5-27B 4-bit on an M4 Pro. The gain is meaningful, but so are the constraints around converted checkpoints, disabled batching, and untested MoE variants.
A Show HN repo claims that duplicating a few LLM layers can improve reasoning without training or weight changes. The underlying README, however, shows real tradeoffs, making this more convincing as capability steering than as a universal model upgrade.
Ollama said on March 20, 2026 that NVIDIA’s Nemotron-Cascade-2 can now run through its local model stack. The official model page positions it as an open 30B MoE model with 3B activated parameters, thinking and instruct modes, and built-in paths into agent tools such as OpenClaw, Codex, and Claude.
OpenAI outlines how GPT-5.4 can produce stronger frontends with tighter constraints and real content
OpenAI said on March 20, 2026 that better GPT-5.4 frontend work starts with explicit constraints, visual references, and real content instead of vague prompts. The linked OpenAI Developers guide turns that idea into a practical playbook for shipping more polished web interfaces.
A merged Hugging Face Transformers PR surfaced on r/LocalLLaMA shows Mistral 4 as a hybrid instruct/reasoning model with 128 experts, 4 active experts, 6.5B activated parameters per token, 256k context, and Apache 2.0 licensing.
IBM Granite on 2026-03-20 released Mellea 0.4.0 and three Granite Libraries built around Granite 4.0 Micro. The release is aimed at teams that want more structured, schema-safe, and safety-aware agentic RAG pipelines instead of depending on prompt-only orchestration.