Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

Reading an LLM or agent leaderboard now requires a second question: are the tasks sound enough to rank the models? A paper submitted to arXiv on May 25, 2026 and revised on May 26 introduces Auto Benchmark Audit, or ABA, an agentic framework for auditing individual benchmark tasks before their scores harden into market signals.

The authors evaluated 168 benchmarks across nine domains, including frontier LLM benchmarks and prior NeurIPS publications. ABA looks for hidden environment dependencies, missing specifications, brittle grading logic, ambiguous task design, and incorrect ground truths. The paper reports critical issues in more than 25.7% of the evaluated tasks.

The practical effect is large enough to matter. After filtering problematic tasks, average performance rose by 9.9% on SWE-bench Verified and 9.6% on Terminal-Bench 2, and model rankings shifted. That means benchmark noise is not evenly distributed. A flawed task can penalize one model, reward another, or measure setup luck rather than the capability the benchmark claims to test.

This is especially important for agent benchmarks. Unlike short question-answer tests, agent tasks often depend on executable environments, file systems, package versions, terminal behavior, and tool-use instructions. Even expert-written tasks can contain assumptions that are obvious to the author but invisible to the model or the evaluator. As benchmarks grow more realistic, they also become harder to inspect by hand.

The authors say ABA's findings were validated through expert review and independent third-party signals, including upstream pull requests. They also release the auditing tool and task annotations so future benchmark builders can inspect and repair tasks rather than relying only on post-hoc leaderboard debate.

The implication is not that benchmarks are useless. It is that benchmarks need their own quality-control loop. Model scores increasingly guide procurement, research priorities, product claims, and investment narratives. When a few points can change who appears to lead, benchmark maintenance becomes infrastructure. The next credible leaderboard may need to publish not only model scores, but also defect rates, audit trails, and fixes for the tasks behind those scores.

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

Related Articles

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

GitHub Copilot harness matches native agents across five coding benches