Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

The Hacker News discussion around Berkeley's new benchmark audit turned into a broader warning about how AI agent systems are being evaluated. The researchers say they audited eight well-known agent benchmarks and found exploitable scoring paths in all of them. Their central claim is that a benchmark can show a near-perfect result even when the agent never solves the task it was supposed to solve.

The examples are concrete. Berkeley says a short pytest hook can make SWE-bench tests look green, a fake curl wrapper can hand Terminal-Bench a perfect score, WebArena can leak gold answers through file navigation, and FieldWorkArena can be passed with a trivial JSON reply because the validator never checks correctness. The writeup places these cases next to earlier public failures such as contaminated training data, reward hacking reports from METR, and OpenAI's decision to drop SWE-bench Verified after an internal audit.

SWE-bench is described as vulnerable to a conftest.py hook that rewrites test outcomes.
Terminal-Bench can reportedly be fooled through a fake curl or uvx chain inside the evaluation flow.
WebArena and FieldWorkArena are presented as examples of answer leakage and weak validation logic.

HN readers largely agreed that the catalog matters, but they split on interpretation. Supportive comments called it an overdue correction for a leaderboard culture that overweights raw scores. More skeptical readers argued that manually designing exploits is different from showing models will spontaneously attack evaluators in routine use, and repeated a more cautious rule: trust methodology, not the headline number.

That tension is what makes the story useful for practitioners. It does not mean agent benchmarks should be ignored. It means benchmark results now need sandbox isolation, explicit anti-tampering design, and clearer disclosure about what part of the score came from task completion versus evaluator assumptions. For teams buying coding agents, the HN consensus was blunt: trust the evaluation setup before you trust the score.

Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

Related Articles

Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores

Google DeepMind proposes a cognitive framework for measuring AGI progress

Cursor agents lift NVIDIA Blackwell CUDA kernels by 38%

Related Articles

Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores
AI Hacker News Apr 12, 2026 1 min read

Google DeepMind proposes a cognitive framework for measuring AGI progress
AI Mar 19, 2026 2 min read

Cursor agents lift NVIDIA Blackwell CUDA kernels by 38%
AI X/Twitter Apr 16, 2026 1 min read