FrontierCode Asks Whether an AI Patch Would Actually Get Merged

Cognition’s FrontierCode benchmark targets a question that many software teams already ask about AI-generated code: would this patch actually be merged? Instead of treating unit-test success as the finish line, the benchmark evaluates behavior, regression safety, test quality, scope discipline, style, and fit with the norms of the target codebase.

The benchmark was built with maintainers from 36 open-source repositories. Cognition says each task took more than 40 hours of expert effort and was manually reviewed by its researchers. FrontierCode has three subsets: Extended with 150 tasks, Main with the hardest 100, and Diamond with the hardest 50. On Diamond, Cognition reports that Claude Opus 4.8 scored 13.4%, GPT-5.5 scored 6.3%, and Gemini 3.1 Pro scored 4.7%.

The HN discussion focused on why this matters now. Coding agents can often produce patches that look plausible and pass a visible test suite, yet still fail a real review because they overreach, miss project conventions, add weak tests, or solve the wrong part of the problem. FrontierCode tries to encode maintainer judgment into the eval rather than relying only on test harnesses.

The original Cognition post argues that older coding benchmarks are prone to misclassification, especially false positives where a patch is rewarded despite being wrong or unmergeable. The HN thread turned that into a broader point: as AI-written code moves closer to production, the benchmark that matters is not “did it run once,” but “would a careful maintainer accept owning it.”

FrontierCode Asks Whether an AI Patch Would Actually Get Merged

Related Articles

GLM 5.2 tops Claude Code in Semgrep security benchmark

LM Studio Bionic turns open models into a desktop agent workflow

Grok Build goes open source, and the debate jumps straight to trust

Related Articles

GLM 5.2 tops Claude Code in Semgrep security benchmark
LLM Hacker News Jun 30, 2026 1 min read

LM Studio Bionic turns open models into a desktop agent workflow
LLM Hacker News Jul 18, 2026 1 min read

Grok Build goes open source, and the debate jumps straight to trust
LLM Hacker News Jul 18, 2026 1 min read