GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%

What the tweet revealed

Artificial Analysis summarized the first wave of GPT-5.5 testing with a blunt headline: OpenAI’s new model tops the Artificial Analysis Intelligence Index by 3 points, breaking a three-way tie with Anthropic and Google. That matters because the post is not a generic reaction thread. It says Artificial Analysis received pre-release access and evaluated all five effort levels: xhigh, high, medium, low, and non-reasoning.

The Artificial Analysis account typically publishes independent model evaluations, leaderboard movements, and cost breakdowns across frontier systems. This thread fits that pattern exactly. It tries to answer three questions at once: did GPT-5.5 move the quality frontier, on which workloads, and at what operational cost?

What the benchmark thread actually claimed

The post says GPT-5.5 xhigh leads five headline evaluations, including Terminal-Bench Hard, GDPval-AA, and APEX-Agents-AA. It also gives unusually concrete tradeoff numbers. Artificial Analysis says per-token pricing doubled from GPT-5.4 to $5 per 1M input tokens and $30 per 1M output tokens, but a roughly 40% token-use reduction softened the blow so the net cost of running its full Intelligence Index rose by about 20% rather than 100%.

The thread adds more nuance than the “number one” framing suggests. GPT-5.5 xhigh is listed at 57% accuracy on AA-Omniscience, the highest on that benchmark, but with an 86% hallucination rate, compared with 36% for Claude Opus 4.7 max and 50% for Gemini 3.1 Pro Preview. Another line says GPT-5.5 medium matches Claude Opus 4.7 max on the Intelligence Index at roughly one quarter of the cost, around $1,200 versus $4,800.

What to watch next

The important follow-on question is reproducibility. Artificial Analysis is a credible benchmarking shop, but this is still one organization’s early read. Watch for public benchmark replications, methodology write-ups for GDPval-AA and APEX-Agents-AA, and whether other labs see the same pattern: a clear lift in capability, but with a lingering hallucination tradeoff and only partial relief on cost.

Sources: X source tweet

GPT-5.5 jumps 3 points clear on Artificial Analysis, but cost rises 20%

What the tweet revealed

What the benchmark thread actually claimed

What to watch next

Related Articles

IBM's VAKRA benchmark exposes where tool agents fail

LocalLLaMA Turns a 'Model Got Dumber' Complaint Into a Measurement Problem

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

Comments (0)

Leave a Comment

Related Articles

IBM's VAKRA benchmark exposes where tool agents fail

LocalLLaMA Turns a 'Model Got Dumber' Complaint Into a Measurement Problem

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression