GPT-5.5 and Claude Opus 4.7 Both Score Under 1% on ARC-AGI-3
Original: ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) View original →
ARC-AGI-3 Is No Joke
Even the most capable frontier models barely register on ARC-AGI-3. The latest community-shared results put GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% — effectively near-zero performance on a benchmark that most humans handle with ease.
What Is ARC-AGI-3?
The Abstraction and Reasoning Corpus (ARC-AGI) tests abstract pattern recognition and reasoning — tasks trivial for humans but apparently very hard for LLMs. Version 3 is significantly harder than its predecessors. As one r/singularity commenter put it: "If AI can't play games a 3-year-old could play, something is wrong with current models."
A Surprising Regression
Notably, Claude Opus 4.7 scored lower than Opus 4.6 on this benchmark, reigniting debate about whether newer models always improve across all dimensions. This suggests current training approaches may not be advancing genuine abstract reasoning, even as they improve on many other metrics.
The Road to 80%
The community is asking: how many months until a model cracks 80%? ARC-AGI-3 is increasingly being treated as a meaningful signal for true AGI progress — and the current results suggest that signal is still far from lighting up.
Related Articles
The latest ARC-AGI-3 benchmark results reveal GPT-5.5 scoring 0.43% and Claude Opus 4.7 at just 0.18%, underscoring the extreme difficulty of this next-generation AGI evaluation.
Why it matters: persistent memory is one of the missing pieces between demo agents and useful long-running agents. Anthropic pushed the feature into public beta on April 23 and framed it as a memory layer that learns from every session.
Anthropic’s new agent-market experiment matters because it turns model quality into money. In a 69-person office marketplace, Claude agents closed 186 deals worth just over $4,000, and Opus-backed users got better prices without noticing.
Comments (0)
No comments yet. Be the first to comment!