Opus 4.8 beats GPT-5.5 by 121 points on GDPval-AA agent benchmark
Original: Opus 4.8 leads GPT-5.5 by 121 points on GDPval-AA benchmark View original →
Claude Opus 4.8’s early benchmark story is moving beyond generic launch claims into a concrete agent-work comparison. Artificial Analysis says Opus 4.8 scored 1890 on GDPval-AA at launch with the max effort setting, putting it 121 points ahead of the next-best model, GPT-5.5 xhigh. The account also translated that head-to-head comparison into an estimated 67% win rate on the GDPval task set.
The tweet’s core benchmark claim was “1890 on GDPval-AA” and “+121 points ahead.”
The source tweet says Anthropic gave Artificial Analysis early access before public release, and that the broader Artificial Analysis Intelligence Index is still being completed. The account is known for standardized LLM comparisons across quality, speed, and price, so its post matters because it frames Opus 4.8 against GPT-5.5 rather than only against earlier Claude models.
The result lines up with Anthropic’s own Opus 4.8 release notes. Anthropic says the model improves judgment, flags uncertainty more often, and is roughly four times less likely than Opus 4.7 to let flaws in its own code pass without comment. The release also adds dynamic workflows in Claude Code, effort control in claude.ai, and a Messages API change that lets developers update system entries inside the messages array during a task.
GDPval-AA is notable because it targets real-world agentic work rather than a narrow coding patch. That makes the score useful for teams evaluating research, analysis, and long-running tool workflows. The next thing to watch is whether the full Index shows the same advantage after latency, token use, and failure modes are included. For production teams, the practical test is still local: run Opus 4.8, GPT-5.5, and Opus 4.7 on the same internal tasks with the same budget before changing defaults.
Related Articles
Anthropic and KPMG announced a global strategic alliance on May 19, embedding Claude into KPMG's Digital Gateway platform for all 276,000 employees, with priority rollout in tax, private equity, and cybersecurity workflows.
AnthropicAI highlighted an Engineering Blog post on March 24, 2026 about using a multi-agent harness to keep Claude productive across frontend and long-running software engineering tasks. The underlying Anthropic post explains how initializer agents, incremental coding sessions, progress logs, structured feature lists, and browser-based testing can reduce context-window drift and premature task completion.
Anthropic said on February 25, 2026 that it acquired Vercept to strengthen Claude’s computer use capabilities. The company tied the deal to Sonnet 4.6’s rise to 72.5% on OSWorld and its broader push toward agent systems that can act inside live applications.
Comments (0)
No comments yet. Be the first to comment!