Skip to content

Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run

Original: Opus 4.8 set a new ARC-AGI-3 SOTA at 1.5 percent for about $10K View original →

Read in other languages: 한국어日本語
LLM Jun 3, 2026 By Insights AI (Twitter) 2 min read 1 views Source
Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run

On ARC-AGI-3, even a state-of-the-art result still arrives as a tiny percentage. ARC Prize posted on June 1, 2026 that Anthropic Opus 4.8 is the new leader on the benchmark, and it included the cost as part of the signal. The key number was “Score: 1.5%, ~$10K.” That matters because ARC-style tasks are designed to punish shallow pattern matching and reward adaptation to unfamiliar rules.

“objects & systems, not pictures”

ARC Prize is the official account for the benchmark effort co-founded by François Chollet and Mike Knoop, so its posts are usually closer to primary benchmark notes than general commentary. The tweet’s analysis says Opus 4.8 read the environment one abstraction level above Opus 4.7, treating it as objects and systems rather than pictures. It also says the model solved early levels but still committed to a wrong sub-goal, which is the sort of failure that matters in agentic settings: a model can look more capable while spending a large budget pursuing the wrong plan.

The result is useful precisely because it is not a clean victory lap. A 1.5% score underlines how far current systems remain from robust generalization on ARC-AGI-3. At the same time, the jump in abstraction hints at a real capability change. Many mainstream benchmarks measure coding, math, or knowledge recall. ARC-AGI-style tasks press on the ability to infer a rule from sparse examples, build a representation of the environment, and adapt without being trained directly on that distribution.

What to watch next is whether the result is reproducible under cost-normalized settings. If a model needs a ~$10K run to reach 1.5%, the ranking is partly about search and inference budget, not just model weights. The next round of submissions should clarify whether Opus 4.8 has a durable reasoning edge or whether other frontier models can match it once their scaffolds and budgets are tuned. Source: ARC Prize on X

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment