Arena puts GPT-5.5 at #2 in search and +50 in Code Arena
Original: Arena puts GPT-5.5 at #2 in search and +50 in Code Arena View original →
Arena.ai’s April 27 X post is one of the first broad external scorecards for GPT-5.5 after OpenAI shipped the model on April 23. That matters because launch threads usually tell you what a lab wants the model to be known for. Community evaluation tells you where it actually lands when people compare it against rivals across different tasks.
“Code Arena: #9, a strong +50pt jump over GPT-5.4 … Search Arena: #2 … Expert Arena: #5.”
The Arena account, formerly LMArena, usually posts community-driven benchmark updates across text, search, vision, and coding. This thread is valuable precisely because it is not a single vanity metric. The breakdown is mixed but informative: GPT-5.5 ranks #6 in Document Arena, #7 in Text Arena, #3 in Math, #8 in Instruction Following, #5 in Vision, and #2 in Search. That profile suggests a model that improved broadly, but not one that simply swept every leaderboard on arrival.
The coding result is the easiest number to misread. A #9 rank does not sound dramatic on its own, but the thread says GPT-5.5 gained 50 points versus GPT-5.4 in Code Arena, which measures agentic web-development tasks. In other words, the model appears meaningfully stronger than its predecessor even though it still trails the very top tier. The same thread also points to a #5 finish in Expert Arena, which is a better fit for users who care about hard professional prompts than for those who only care about casual chatbot feel.
What to watch next is whether these placements hold once more samples arrive and whether higher-reasoning configurations move the coding rank upward. The current takeaway is not “GPT-5.5 won everything.” It is that OpenAI’s new model looks more balanced than the launch hype alone could prove, with especially clear movement in coding and search.
Related Articles
OpenAI is pushing harder into agentic work, not just chat. On the company's own evals, GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, beats GPT-5.4 by 7.6 points, and uses fewer tokens in Codex.
HN treated GPT-5.5 less like another model launch and more like a test of whether AI can actually carry messy computer tasks to completion. The discussion kept drifting from benchmarks to rollout timing, API access, and whether the gains show up in real coding work.
OpenAI is pitching GPT-5.5 as more than a routine model refresh. With 82.7% on Terminal-Bench 2.0, 58.6% on SWE-Bench Pro, and a claim that it keeps GPT-5.4-level latency, the company is resetting expectations for long-running coding agents.
Comments (0)
No comments yet. Be the first to comment!