Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

Original: Exploiting the most prominent AI agent benchmarks View original →

Read in other languages: 한국어日本語
AI Apr 13, 2026 By Insights AI (HN) 2 min read Source

The Hacker News discussion around Berkeley's new benchmark audit turned into a broader warning about how AI agent systems are being evaluated. The researchers say they audited eight well-known agent benchmarks and found exploitable scoring paths in all of them. Their central claim is that a benchmark can show a near-perfect result even when the agent never solves the task it was supposed to solve.

The examples are concrete. Berkeley says a short pytest hook can make SWE-bench tests look green, a fake curl wrapper can hand Terminal-Bench a perfect score, WebArena can leak gold answers through file navigation, and FieldWorkArena can be passed with a trivial JSON reply because the validator never checks correctness. The writeup places these cases next to earlier public failures such as contaminated training data, reward hacking reports from METR, and OpenAI's decision to drop SWE-bench Verified after an internal audit.

  • SWE-bench is described as vulnerable to a conftest.py hook that rewrites test outcomes.
  • Terminal-Bench can reportedly be fooled through a fake curl or uvx chain inside the evaluation flow.
  • WebArena and FieldWorkArena are presented as examples of answer leakage and weak validation logic.

HN readers largely agreed that the catalog matters, but they split on interpretation. Supportive comments called it an overdue correction for a leaderboard culture that overweights raw scores. More skeptical readers argued that manually designing exploits is different from showing models will spontaneously attack evaluators in routine use, and repeated a more cautious rule: trust methodology, not the headline number.

That tension is what makes the story useful for practitioners. It does not mean agent benchmarks should be ignored. It means benchmark results now need sandbox isolation, explicit anti-tampering design, and clearer disclosure about what part of the score came from task completion versus evaluator assumptions. For teams buying coding agents, the HN consensus was blunt: trust the evaluation setup before you trust the score.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.