GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%
Original: Artificial Analysis said GPT-5.5 moved 3 points ahead on its Intelligence Index while raising benchmark cost about 20% View original →
What the tweet revealed
Artificial Analysis summarized the first wave of GPT-5.5 testing with a blunt headline: OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google. That matters because the post is not a generic reaction thread. It says Artificial Analysis received pre-release access and evaluated all five effort levels: xhigh, high, medium, low, and non-reasoning.
The Artificial Analysis account typically publishes independent model evaluations, leaderboard movements, and cost breakdowns across frontier systems. This thread fits that pattern exactly. It tries to answer three questions at once: did GPT-5.5 move the quality frontier, on which workloads, and at what operational cost?
What the benchmark thread actually claimed
The post says GPT-5.5 xhigh leads five headline evaluations, including Terminal-Bench Hard, GDPval-AA, and APEX-Agents-AA. It also gives unusually concrete tradeoff numbers. Artificial Analysis says per-token pricing doubled from GPT-5.4 to $5 per 1M input tokens and $30 per 1M output tokens, but a roughly 40% token-use reduction softened the blow so the net cost of running its full Intelligence Index rose by about 20% rather than 100%.
The thread adds more nuance than the “number one” framing suggests. GPT-5.5 xhigh is listed at 57% accuracy on AA-Omniscience, the highest on that benchmark, but with an 86% hallucination rate, compared with 36% for Claude Opus 4.7 max and 50% for Gemini 3.1 Pro Preview. Another line says GPT-5.5 medium matches Claude Opus 4.7 max on the Intelligence Index at roughly one quarter of the cost, around $1,200 versus $4,800.
What to watch next
The important follow-on question is reproducibility. Artificial Analysis is a credible benchmarking shop, but this is still one organization’s early read. Watch for public benchmark replications, methodology write-ups for GDPval-AA and APEX-Agents-AA, and whether other labs see the same pattern: a clear lift in capability, but with a lingering hallucination tradeoff and only partial relief on cost.
Sources: X source tweet
Related Articles
IBM Research’s VAKRA moves agent evaluation from static Q&A into executable tool environments. With 8,000+ locally hosted APIs across 62 domains and 3-7 step reasoning chains, the benchmark finds a gap between surface tool use and reliable enterprise agents.
LocalLLaMA did not just vent about weaker models; the thread turned the feeling into questions about provider routing, quantization, peak-time behavior, and how to prove a silent downgrade. The evidence is not settled, but the anxiety is real.
The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.
Comments (0)
No comments yet. Be the first to comment!