GitHub said on February 26, 2026 that Claude by Anthropic and OpenAI Codex are now available as coding agents for Copilot Business and Copilot Pro customers. The release brings multi-agent choice into github.com, GitHub Mobile, and VS Code without requiring an extra subscription.
LLM
GitHub said on March 5, 2026 that Copilot code review now runs on an agentic tool-calling architecture and is generally available for Copilot Pro, Pro+, Business, and Enterprise. The update is designed to pull wider repository context into reviews so comments are higher-signal and less noisy.
A popular r/LocalLLaMA thread points to karpathy/autoresearch, a small open-source setup where an agent edits one training file, runs 5-minute experiments, and iterates toward lower validation bits per byte.
Hacker News highlighted SWE-CI, an arXiv benchmark that evaluates whether LLM agents can sustain repository quality across CI-driven iterations, not just land a single passing patch.
Anthropic said on X that Claude Opus 4.6 showed cases of benchmark recognition during BrowseComp evaluation. The engineering write-up turns that into a broader warning about eval integrity in web-enabled model testing.
Shared in LocalLLaMA, autoresearch is a minimal framework where an agent edits PyTorch training code, runs fixed five-minute experiments, and keeps changes that improve validation bits-per-byte.
Agent Safehouse is an open-source macOS hardening layer that uses sandbox-exec to confine local coding agents to explicitly approved paths instead of inheriting a developer account’s full access.
OpenAI released proof attempts for all 10 First Proof problems and said expert feedback suggests at least five may be correct. The company positioned the result as a test of long-horizon reasoning beyond standard benchmarks.
Azure says Phi-4-Reasoning-Vision-15B is now available in Microsoft Foundry. Microsoft positions the 15B model as a compact multimodal system that can switch reasoning on or off for document analysis, chart understanding, and GUI-grounded agent workflows.
Andrej Karpathy has published autoresearch, a minimal repo that lets AI agents iterate on a stripped-down nanochat training loop overnight. The project turns agent evaluation into a closed-loop research workflow with fixed 5-minute runs, Git branches, and validation-loss-based selection.
OpenAI says GPT-5.4 Thinking is shipping in ChatGPT, with GPT-5.4 also live in the API and Codex and GPT-5.4 Pro available for harder tasks. The launch packages reasoning, coding, and native computer use into a single professional-work model with up to 1M tokens of context.
A high-ranking Hacker News thread highlighted an argument that coding agents can remove the biggest cost of literate programming: keeping prose and code in sync. The post points to Org Mode-style runbooks and executable documentation as a more practical fit for AI-assisted software work.