Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

Cracks in Widely Used AI Benchmarks

The GPQA and HLE (Humanity's Last Exam) benchmarks, widely used to evaluate AI model performance, contain serious data quality issues — a fact now confirmed by the Qwen research team in a published paper (arXiv: 2602.13964v2).

How the Problem Was Found

The issue was first surfaced about a month ago by a researcher running an experiment called "DeepSeek-Overclock" — an attempt to push DeepSeek's reasoning capabilities to the absolute limit. The optimized model kept failing, but logs revealed it was not hallucinating. Instead, it was deriving technically correct answers that simply contradicted the provided gold-standard labels.

The researcher wrote Python scripts to verify the math line-by-line from first principles, finding that the dataset's answer labels were simply wrong in many cases. The Qwen team's paper has now formally confirmed these findings.

What's Wrong with the Data

The problems are multi-layered. OCR errors were introduced when creating questions. Some standard answers are straightforwardly incorrect. An analysis by FutureHouse found that only 51.3% of HLE questions are actually supported by research. Some questions are fundamentally flawed or structured in ways that make verification impossible.

Implications for AI Evaluation

This finding raises fundamental questions about the reliability of current AI model benchmarking. If benchmark questions have wrong answers or are unverifiable, it becomes impossible to distinguish genuine capability improvements from models that have simply memorized the idiosyncrasies of flawed datasets. The AI community is increasingly calling for more rigorous validation processes before benchmark data is accepted as ground truth.

Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

Cracks in Widely Used AI Benchmarks

How the Problem Was Found

What's Wrong with the Data

Implications for AI Evaluation

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

Qwen3.6 lit up LocalLLaMA because the agent actually debugged the app

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

Qwen3.6 lit up LocalLLaMA because the agent actually debugged the app
LLM Reddit Apr 20, 2026 2 min read

Qwen3.6 on an M5 Max Made r/LocalLLaMA Talk About Keeping Code Local
LLM Reddit Apr 20, 2026 2 min read