Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run
Original: Opus 4.8 set a new ARC-AGI-3 SOTA at 1.5 percent for about $10K View original →
On ARC-AGI-3, even a state-of-the-art result still arrives as a tiny percentage. ARC Prize posted on June 1, 2026 that Anthropic Opus 4.8 is the new leader on the benchmark, and it included the cost as part of the signal. The key number was “Score: 1.5%, ~$10K.” That matters because ARC-style tasks are designed to punish shallow pattern matching and reward adaptation to unfamiliar rules.
“objects & systems, not pictures”
ARC Prize is the official account for the benchmark effort co-founded by François Chollet and Mike Knoop, so its posts are usually closer to primary benchmark notes than general commentary. The tweet’s analysis says Opus 4.8 read the environment one abstraction level above Opus 4.7, treating it as objects and systems rather than pictures. It also says the model solved early levels but still committed to a wrong sub-goal, which is the sort of failure that matters in agentic settings: a model can look more capable while spending a large budget pursuing the wrong plan.
The result is useful precisely because it is not a clean victory lap. A 1.5% score underlines how far current systems remain from robust generalization on ARC-AGI-3. At the same time, the jump in abstraction hints at a real capability change. Many mainstream benchmarks measure coding, math, or knowledge recall. ARC-AGI-style tasks press on the ability to infer a rule from sparse examples, build a representation of the environment, and adapt without being trained directly on that distribution.
What to watch next is whether the result is reproducible under cost-normalized settings. If a model needs a ~$10K run to reach 1.5%, the ranking is partly about search and inference budget, not just model weights. The next round of submissions should clarify whether Opus 4.8 has a durable reasoning edge or whether other frontier models can match it once their scaffolds and budgets are tuned. Source: ARC Prize on X
Related Articles
Claude Opus 4.8 is showing its strongest early signal in agentic work, not only coding. Artificial Analysis says the model scored 1890 on GDPval-AA, 121 points ahead of GPT-5.5 xhigh.
PwC will roll out Claude Code and Cowork across its global workforce, launching a 30,000-staff certification program and a joint Center of Excellence. Insurance underwriting cycles already cut from 10 weeks to 10 days.
HN readers focused less on the version number and more on whether same-price upgrades, cheaper fast mode, and Claude Code dynamic workflows will show up in real agent sessions.
Comments (0)
No comments yet. Be the first to comment!