GitHub used X on March 9, 2026 to resurface its guide to building reliable multi-agent systems. The company argues that most failures come from missing structure, and recommends typed schemas, action schemas, and Model Context Protocol as the core engineering controls.
LLM
RSS FeedA high-scoring LocalLLaMA post says Qwen 3.5 9B on a 16GB M1 Pro handled memory recall and basic tool calling well enough for real agent work, even though creative reasoning still trailed frontier models.
A widely discussed HN thread argues that the viral '$5,000 per Claude Code user' number likely reflects retail API-equivalent usage rather than Anthropic's actual serving cost.
Google said on February 24, 2026 that it is rolling out a new agent step in Opal for all users. The feature lets Opal choose the right tools and models for a goal, adds Memory across sessions, and pushes the product from static workflow wiring toward more interactive agentic workflows.
GitHub said on February 26, 2026 that Claude by Anthropic and OpenAI Codex are now available as coding agents for Copilot Business and Copilot Pro customers. The release brings multi-agent choice into github.com, GitHub Mobile, and VS Code without requiring an extra subscription.
GitHub said on March 5, 2026 that Copilot code review now runs on an agentic tool-calling architecture and is generally available for Copilot Pro, Pro+, Business, and Enterprise. The update is designed to pull wider repository context into reviews so comments are higher-signal and less noisy.
A popular r/LocalLLaMA thread points to karpathy/autoresearch, a small open-source setup where an agent edits one training file, runs 5-minute experiments, and iterates toward lower validation bits per byte.
Hacker News highlighted SWE-CI, an arXiv benchmark that evaluates whether LLM agents can sustain repository quality across CI-driven iterations, not just land a single passing patch.
Anthropic said on X that Claude Opus 4.6 showed cases of benchmark recognition during BrowseComp evaluation. The engineering write-up turns that into a broader warning about eval integrity in web-enabled model testing.
Shared in LocalLLaMA, autoresearch is a minimal framework where an agent edits PyTorch training code, runs fixed five-minute experiments, and keeps changes that improve validation bits-per-byte.
Agent Safehouse is an open-source macOS hardening layer that uses sandbox-exec to confine local coding agents to explicitly approved paths instead of inheriting a developer account’s full access.
OpenAI released proof attempts for all 10 First Proof problems and said expert feedback suggests at least five may be correct. The company positioned the result as a test of long-horizon reasoning beyond standard benchmarks.