A harder coding benchmark widens the model spread

Coding-agent benchmarks are most useful when they expose real engineering failure modes, not only leaderboard polish. Serena Ge, Datacurve’s co-founder and CEO, posted on X on May 26, 2026 that the company had released DeepSWE, writing that “DeepSWE shows where they actually diverge.” The source tweet is available here.

“DeepSWE shows where they actually diverge”

The concrete benchmark is large enough to matter: 113 original tasks drawn from 91 repositories across 5 programming languages. DeepSWE says the tasks are written from scratch rather than adapted from existing pull requests or commits, which is meant to reduce contamination from model pretraining. It also uses shallow clones so agents cannot inspect repository history to recover hidden reference solutions.

The difficulty profile is the main story. The DeepSWE artifacts show average prompts of 2,158 characters, less than half of SWE-bench Pro’s 4,614-character average, while the expected patches are much larger: 668.1 lines on average versus 120.3 lines for SWE-bench Pro. That is the source of the 5.5x comparison circulating around the benchmark. The evaluation also relies on hand-written behavioral verifiers rather than inherited tests alone, so the score is intended to track visible product behavior through public APIs.

The first leaderboard separates models more sharply than many public boards. GPT-5.5 leads at 70.0% pass@1, GPT-5.4 follows at 55.5%, and Claude Opus 4.7 lands at 54.2%. The cost and trajectory metrics add another layer: GPT-5.5’s median passing run costs about $5.76 and takes 75 steps, while Claude Opus 4.7’s median passing run costs about $15.95 and takes 191 steps.

Ge’s account normally posts around Datacurve’s data work, benchmarks, and coding-agent evaluation, so this tweet is a primary signal rather than a reaction thread. The next thing to watch is independent reproduction: whether outside researchers can validate the tasks, whether model providers challenge the methodology, and whether older SWE-bench-style boards respond by tightening contamination controls and verifier audits.

#deepswe

DeepSWE’s 113 tasks put GPT-5.5 at 70% and Claude Opus 4.7 at 54%