Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run

On ARC-AGI-3, even a state-of-the-art result still arrives as a tiny percentage. ARC Prize posted on June 1, 2026 that Anthropic Opus 4.8 is the new leader on the benchmark, and it included the cost as part of the signal. The key number was “Score: 1.5%, ~$10K.” That matters because ARC-style tasks are designed to punish shallow pattern matching and reward adaptation to unfamiliar rules.

“objects & systems, not pictures”

ARC Prize is the official account for the benchmark effort co-founded by François Chollet and Mike Knoop, so its posts are usually closer to primary benchmark notes than general commentary. The tweet’s analysis says Opus 4.8 read the environment one abstraction level above Opus 4.7, treating it as objects and systems rather than pictures. It also says the model solved early levels but still committed to a wrong sub-goal, which is the sort of failure that matters in agentic settings: a model can look more capable while spending a large budget pursuing the wrong plan.

The result is useful precisely because it is not a clean victory lap. A 1.5% score underlines how far current systems remain from robust generalization on ARC-AGI-3. At the same time, the jump in abstraction hints at a real capability change. Many mainstream benchmarks measure coding, math, or knowledge recall. ARC-AGI-style tasks press on the ability to infer a rule from sparse examples, build a representation of the environment, and adapt without being trained directly on that distribution.

What to watch next is whether the result is reproducible under cost-normalized settings. If a model needs a ~$10K run to reach 1.5%, the ranking is partly about search and inference budget, not just model weights. The next round of submissions should clarify whether Opus 4.8 has a durable reasoning edge or whether other frontier models can match it once their scaffolds and budgets are tuned. Source: ARC Prize on X

Opus 4.8 reaches ARC-AGI-3 SOTA with 1.5% score and ~$10K run

Related Articles

Claude value profiles diverge across 300K chats, models and languages

Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months

Related Articles

Claude value profiles diverge across 300K chats, models and languages

Claude Opus 4.6 Hits 14.5-Hour Mark on METR's Software Task Benchmark
LLM Reddit Feb 22, 2026 1 min read

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months
LLM Feb 16, 2026 2 min read