Skip to content

Opus 4.8 beats GPT-5.5 by 121 points on GDPval-AA agent benchmark

Original: Opus 4.8 leads GPT-5.5 by 121 points on GDPval-AA benchmark View original →

Read in other languages: 한국어日本語
LLM May 29, 2026 By Insights AI (Twitter) 1 min read 1 views Source
Opus 4.8 beats GPT-5.5 by 121 points on GDPval-AA agent benchmark

Claude Opus 4.8’s early benchmark story is moving beyond generic launch claims into a concrete agent-work comparison. Artificial Analysis says Opus 4.8 scored 1890 on GDPval-AA at launch with the max effort setting, putting it 121 points ahead of the next-best model, GPT-5.5 xhigh. The account also translated that head-to-head comparison into an estimated 67% win rate on the GDPval task set.

The tweet’s core benchmark claim was “1890 on GDPval-AA” and “+121 points ahead.”

The source tweet says Anthropic gave Artificial Analysis early access before public release, and that the broader Artificial Analysis Intelligence Index is still being completed. The account is known for standardized LLM comparisons across quality, speed, and price, so its post matters because it frames Opus 4.8 against GPT-5.5 rather than only against earlier Claude models.

The result lines up with Anthropic’s own Opus 4.8 release notes. Anthropic says the model improves judgment, flags uncertainty more often, and is roughly four times less likely than Opus 4.7 to let flaws in its own code pass without comment. The release also adds dynamic workflows in Claude Code, effort control in claude.ai, and a Messages API change that lets developers update system entries inside the messages array during a task.

GDPval-AA is notable because it targets real-world agentic work rather than a narrow coding patch. That makes the score useful for teams evaluating research, analysis, and long-running tool workflows. The next thing to watch is whether the full Index shows the same advantage after latency, token use, and failure modes are included. For production teams, the practical test is still local: run Opus 4.8, GPT-5.5, and Opus 4.7 on the same internal tasks with the same budget before changing defaults.

Share: Long

Related Articles

LLM X/Twitter Mar 28, 2026 2 min read

AnthropicAI highlighted an Engineering Blog post on March 24, 2026 about using a multi-agent harness to keep Claude productive across frontend and long-running software engineering tasks. The underlying Anthropic post explains how initializer agents, incremental coding sessions, progress logs, structured feature lists, and browser-based testing can reduce context-window drift and premature task completion.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment