A new llama.cpp change turns <code>--reasoning-budget</code> into a real sampler-side limit instead of a template stub. The LocalLLaMA thread focused on the tradeoff between cutting long think loops and preserving answer quality, especially for local Qwen 3.5 deployments.
LLM
RSS FeedA Show HN post for agent-browser-protocol argues that many browser-agent failures are harness failures caused by stale state. HN discussion centered on the project's freeze-after-action design, Chromium maintenance cost, and the reported 90.5% Online Mind2Web score with Opus 4.6.
METR's March 10, 2026 note argues that about half of test-passing SWE-bench Verified PRs from recent agents would still be rejected by maintainers. HN treated it as a warning that benchmark wins do not yet measure scope control, code quality, or repo fit.
Anthropic says Claude for Excel and Claude for PowerPoint now share conversation context across open files, reducing the need to restate data or instructions between spreadsheets and decks. The company also added skills inside the add-ins and expanded deployment through Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
OpenAI Developers published a March 11, 2026 engineering write-up explaining how the Responses API uses a hosted computer environment for long-running agent workflows. The post centers on shell execution, hosted containers, controlled network access, reusable skills, and native compaction for context management.
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
A well-received r/LocalLLaMA experiment described tinyforge: Qwen 3.5 0.8B running on a MacBook Air, trained on 13 self-generated repair pairs from a test-feedback loop, with a reported holdout jump from 16/50 to 28/50.
A high-scoring r/MachineLearning post resurfaced David Noel Ng's long-form write-up, centering on the claim that duplicating a seven-layer middle block in Qwen2-72B, without changing weights, was enough to reach the top of the open leaderboard.
Hacker News pushed Microsoft's bitnet.cpp back into view, treating it less as a new 100B checkpoint and more as an infrastructure play for 1.58-bit inference and lower-power local LLM deployment.
Google AI Developers says Gemini Embedding 2 is now in preview via the Gemini API and Vertex AI. Google describes it as its first fully multimodal embedding model on the Gemini architecture and its most capable embedding model so far.
Microsoft says Fireworks AI is now part of Microsoft Foundry, bringing high-performance, low-latency open-model inference to Azure. The launch emphasizes day-zero access to leading open models, custom-model deployment, and enterprise controls in one place.
A Launch HN thread pulled RunAnywhere’s MetalRT and RCLI into focus, centering attention on a low-latency STT-LLM-TTS stack that runs on Apple Silicon without cloud APIs.