Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

Why HN picked this up

The Hacker News thread traveled because it attacks a comfortable assumption: that passing SWE-bench Verified means a coding agent is close to writing code a repository will actually accept. METR's note argues that benchmark PRs can satisfy a test harness while still adding unnecessary abstraction, violating project conventions, or creating review burden that maintainers would never want to absorb.

METR says it asked 4 active maintainers from 3 repositories to review 296 AI-generated pull requests based on resolved SWE-bench Verified tasks. The headline result is sharp. Maintainer merge decisions came in about 24.2 percentage points below the automated grader, and the public description says roughly half of the test-passing PRs would still not be merged into main. The authors also suggest that progress on real mergeability may be slower than progress on raw benchmark scores.

What the discussion adds

HN commenters treated this as an operations problem, not just benchmark gossip. The repeated point was that tests mostly measure whether an agent produced some working patch, while real code review asks whether it solved only the intended problem, respected local style, stayed within scope, and left the codebase easier rather than harder to own. People called out scope creep, gratuitous layering, and weak sensitivity to repo-specific norms.

That is why this story matters for teams shipping coding agents. The lesson is not that SWE-bench is useless. The lesson is that benchmark results need a second evaluation layer: repo-specific review criteria, diff-size guardrails, and human sign-off on higher-blast-radius changes. The HN takeaway is simple: "tests passed" is becoming a floor, not a merge decision.

Original note | Hacker News discussion

Hacker News Focuses on the Gap Between SWE-bench Passes and Mergeable Code

Why HN picked this up

What the discussion adds

Related Articles

HN thinks the SWE-bench story is about contamination, not bragging rights

Clean code may not make coding agents pass more, but it makes them wander less

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation

Related Articles

HN thinks the SWE-bench story is about contamination, not bragging rights
LLM Hacker News Apr 28, 2026 2 min read

Clean code may not make coding agents pass more, but it makes them wander less
LLM Hacker News Jul 6, 2026 1 min read

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation
LLM X/Twitter Jul 10, 2026 2 min read