OpenAI is pushing harder into agentic work, not just chat. On the company's own evals, GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, beats GPT-5.4 by 7.6 points, and uses fewer tokens in Codex.
#agentic-coding
RSS FeedOpenAI is pitching GPT-5.5 as more than a routine model refresh. With 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and a claim that it keeps GPT-5.4-level latency, the company is resetting expectations for long-running coding agents.
HN treated GPT-5.5 less like another model launch and more like a test of whether AI can actually carry messy computer tasks to completion. The discussion kept drifting from benchmarks to rollout timing, API access, and whether the gains show up in real coding work.
A large Hacker News thread turned a Claude Code quota complaint into a deeper argument about how prompt caching, background sessions, and auto-compacts behave inside 1M-context agent workflows. The GitHub issue author published April 9, 2026 usage logs, and the discussion quickly shifted from “limits feel worse” to cache accounting and quota transparency.
Hacker News picked up Z.ai's GLM-5.1 as a model aimed less at one-shot wins and more at sustained agentic work. Z.ai reports 58.4 on SWE-Bench Pro, 42.7 on NL2Repo, 66.5 on Terminal Bench 2.0, and long-horizon runs that keep improving through hundreds of iterations and thousands of tool calls.
Cursor has published a technical report for Composer 2, outlining a two-stage recipe of continued pretraining and large-scale reinforcement learning for agentic software engineering. The company says the model reaches 61.3 on CursorBench, 61.7 on Terminal-Bench, and 73.7 on SWE-bench Multilingual while keeping pricing at $0.50/M input and $2.50/M output tokens.
In a March 29, 2026 X post, OpenAI Developers introduced Codex Security, a research preview aimed at identifying, validating, and remediating software vulnerabilities. The launch extends AI coding assistance into application security workflows.
Anthropic said on March 30, 2026 that computer use is now available in Claude Code in research preview for Pro and Max plans. Claude Code docs say the feature lets Claude open apps, click through UI flows, and see the screen on macOS from the CLI, targeting native app testing, visual debugging, and other GUI-only tasks.
OpenAI said on March 6, 2026 that Codex Security is entering research preview for ChatGPT Pro, Enterprise, Business, and Edu users in Codex web. The company says the application-security agent uses project-specific threat models, contextual validation, and patch proposals, and in beta scanned more than 1.2 million commits.
Google says Gemini CLI now includes a read-only Plan mode that analyzes requests, codebases, and dependencies before any edits happen. The update also adds an ask_user tool and read-only MCP access so teams can clarify requirements and pull in outside context without risking accidental changes.
OpenAIDevs said on March 16, 2026 that subagents are now available in Codex. The feature lets developers keep the main context clean, split work across specialized agents, and steer individual threads as they run, while the official docs already describe PR review and CSV batch fan-out patterns.
GitHub used X on March 15, 2026 to spotlight the Copilot CLI `/fleet` command for routine maintenance work. GitHub’s official Copilot CLI materials now describe `/fleet` as a parallel sub-agent workflow that converges multiple runs into one decision-ready result.