Cursor puts GPT-5.5 atop CursorBench at 72.8% and halves price
Original: Cursor puts GPT-5.5 on top of CursorBench with a 72.8% score View original →
The headline from Cursor’s latest X post is not just model availability. It is that GPT-5.5 entered the product with both a concrete benchmark claim and a temporary price cut attached. Cursor says GPT-5.5 is now available in the editor, currently ranks first on CursorBench at 72.8%, and is being sold at 50% off through May 2. In a market where many coding-model updates arrive as vague “feels better” claims, that combination is unusually specific.
“It’s currently the top model on CursorBench at 72.8%.”
That sentence comes directly from Cursor’s source tweet. A matching forum thread added the pricing details and clarified the promotion window after users spotted inconsistent dates in the UI. According to Cursor staff, list pricing is $5.00 per million input tokens, $0.50 for cached input, and $30.00 for output; the temporary discount cuts those to $2.50, $0.25, and $15.00 respectively through the end of May 2. That matters because output-token cost is often what makes frontier coding models hard to use at scale.
The more interesting context is CursorBench itself. In Cursor’s March research post, How we compare model quality in Cursor, the company says CursorBench is built from real engineering sessions rather than public repository issues. It argues that the suite tracks actual developer outcomes better than public benchmarks, uses agentic grading, and now covers larger multi-file, tool-using tasks. Cursor also says the current CursorBench-3 task scope has roughly doubled from the initial version and creates more separation among frontier models than saturated public evals.
That does not make 72.8% a neutral industry crown. CursorBench is still an internal benchmark run by the company that sells the product. But it does make the number more relevant than a generic leaderboard screenshot, because the benchmark is explicitly trying to mirror the kinds of underspecified, multi-step tasks developers give coding agents every day. For product users, that is often the right question: not which model wins in abstract, but which one gets more real work over the line inside the tool they already use.
The cursor_ai account usually mixes release notes, agent features, and evaluation methodology, and this post follows that pattern closely. What to watch next is whether independent usage reports match the 72.8% claim, whether GPT-5.5 keeps its lead as other coding agents update, and whether the economics still make sense after the discount ends on May 2. The primary sources are the tweet, Cursor’s forum post, and the CursorBench methodology note.
Related Articles
OpenAI is pushing harder into agentic work, not just chat. On the company's own evals, GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, beats GPT-5.4 by 7.6 points, and uses fewer tokens in Codex.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
LocalLLaMA cared about this eval post because it mixed leaderboard data with lived coding-agent pain: Opus 4.7 scored well, but the author says it felt worse in real use.
Comments (0)
No comments yet. Be the first to comment!