Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores
Original: How We Broke Top AI Agent Benchmarks: And What Comes Next View original →
Why the post mattered on Hacker News
The UC Berkeley write-up published in April 2026 drew 202 points and 58 comments on Hacker News by April 12, 2026. Its premise is unusually direct: the researchers audited eight widely cited AI agent benchmarks and found ways to achieve near-perfect scores without solving the tasks those benchmarks were supposed to measure.
The paper’s broader claim is that benchmark numbers are not a reliable proxy for capability when the evaluation pipeline is easy to exploit. The authors focus on three recurring failure modes: agents can tamper with the artifacts the evaluator later reads, gold answers are sometimes exposed inside configs or public files, and validators often score superficial output patterns rather than actual task completion.
Examples Berkeley highlights
- On SWE-bench Verified, the team says a short
conftest.pyhook can force every test to pass. - On Terminal-Bench, a fake
curlwrapper can produce a perfect score across all 89 tasks. - On WebArena, an agent can navigate Chromium to a local
file://path and read answer keys from config files. - On FieldWorkArena, the validator reportedly checks only whether the final message came from the assistant, so sending
{}is enough to pass.
What comes next
The article does more than embarrass benchmark builders. It lays out concrete hardening steps: isolate evaluator state, prevent agents from writing to the paths that scoring code trusts, use more robust scoring, and keep ground truth private for any benchmark that drives public leaderboards. The authors are also developing BenchJack, an automated benchmark vulnerability scanner. For teams using benchmark tables to decide which agent stack to deploy, the message is hard to miss: don’t trust the number before you trust the methodology.
Original source: UC Berkeley RDI. Hacker News discussion: thread.
Related Articles
A 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.
The Megalodon campaign pushed 5,718 malicious commits into 5,561 GitHub repositories in roughly six hours. The target was not just application code, but GitHub Actions workflows that can expose cloud credentials, CI secrets, and deployment tokens.
TrapDoor pushed more than 34 malicious packages across npm, PyPI, and Crates.io after May 22. The sharpest twist is not just credential theft, but the attempt to poison .cursorrules and CLAUDE.md files read by AI coding assistants.
Comments (0)
No comments yet. Be the first to comment!