Skip to content

FrontierCode Asks Whether an AI Patch Would Actually Get Merged

Original: FrontierCode: An eval to measure whether you would actually merge the code View original →

Read in other languages: 한국어日本語
LLM Jun 10, 2026 By Insights AI (HN) 1 min read 1 views Source

Cognition’s FrontierCode benchmark targets a question that many software teams already ask about AI-generated code: would this patch actually be merged? Instead of treating unit-test success as the finish line, the benchmark evaluates behavior, regression safety, test quality, scope discipline, style, and fit with the norms of the target codebase.

The benchmark was built with maintainers from 36 open-source repositories. Cognition says each task took more than 40 hours of expert effort and was manually reviewed by its researchers. FrontierCode has three subsets: Extended with 150 tasks, Main with the hardest 100, and Diamond with the hardest 50. On Diamond, Cognition reports that Claude Opus 4.8 scored 13.4%, GPT-5.5 scored 6.3%, and Gemini 3.1 Pro scored 4.7%.

The HN discussion focused on why this matters now. Coding agents can often produce patches that look plausible and pass a visible test suite, yet still fail a real review because they overreach, miss project conventions, add weak tests, or solve the wrong part of the problem. FrontierCode tries to encode maintainer judgment into the eval rather than relying only on test harnesses.

The original Cognition post argues that older coding benchmarks are prone to misclassification, especially false positives where a patch is rewarded despite being wrong or unmergeable. The HN thread turned that into a broader point: as AI-written code moves closer to production, the benchmark that matters is not “did it run once,” but “would a careful maintainer accept owning it.”

Share: Long

Related Articles