FrontierCode Asks Whether an AI Patch Would Actually Get Merged
Original: FrontierCode: An eval to measure whether you would actually merge the code View original →
Cognition’s FrontierCode benchmark targets a question that many software teams already ask about AI-generated code: would this patch actually be merged? Instead of treating unit-test success as the finish line, the benchmark evaluates behavior, regression safety, test quality, scope discipline, style, and fit with the norms of the target codebase.
The benchmark was built with maintainers from 36 open-source repositories. Cognition says each task took more than 40 hours of expert effort and was manually reviewed by its researchers. FrontierCode has three subsets: Extended with 150 tasks, Main with the hardest 100, and Diamond with the hardest 50. On Diamond, Cognition reports that Claude Opus 4.8 scored 13.4%, GPT-5.5 scored 6.3%, and Gemini 3.1 Pro scored 4.7%.
The HN discussion focused on why this matters now. Coding agents can often produce patches that look plausible and pass a visible test suite, yet still fail a real review because they overreach, miss project conventions, add weak tests, or solve the wrong part of the problem. FrontierCode tries to encode maintainer judgment into the eval rather than relying only on test harnesses.
The original Cognition post argues that older coding benchmarks are prone to misclassification, especially false positives where a patch is rewarded despite being wrong or unmergeable. The HN thread turned that into a broader point: as AI-written code moves closer to production, the benchmark that matters is not “did it run once,” but “would a careful maintainer accept owning it.”
Related Articles
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
ARC Prize put Anthropic Opus 4.8 at the top of ARC-AGI-3, but the score shows how hard the benchmark remains. The new mark is 1.5% at roughly $10K, with progress tied to object-and-system abstraction rather than image-level pattern matching.
A LocalLLaMA discussion of SWE-rebench January runs reports close top-tier results, with Claude Code leading pass@1 and pass@5 while open models narrow the gap.