Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks
Original: The Qwen team verified that there are serious problems with the data quality of the GPQA and HLE test sets. View original →
Cracks in Widely Used AI Benchmarks
The GPQA and HLE (Humanity's Last Exam) benchmarks, widely used to evaluate AI model performance, contain serious data quality issues — a fact now confirmed by the Qwen research team in a published paper (arXiv: 2602.13964v2).
How the Problem Was Found
The issue was first surfaced about a month ago by a researcher running an experiment called "DeepSeek-Overclock" — an attempt to push DeepSeek's reasoning capabilities to the absolute limit. The optimized model kept failing, but logs revealed it was not hallucinating. Instead, it was deriving technically correct answers that simply contradicted the provided gold-standard labels.
The researcher wrote Python scripts to verify the math line-by-line from first principles, finding that the dataset's answer labels were simply wrong in many cases. The Qwen team's paper has now formally confirmed these findings.
What's Wrong with the Data
The problems are multi-layered. OCR errors were introduced when creating questions. Some standard answers are straightforwardly incorrect. An analysis by FutureHouse found that only 51.3% of HLE questions are actually supported by research. Some questions are fundamentally flawed or structured in ways that make verification impossible.
Implications for AI Evaluation
This finding raises fundamental questions about the reliability of current AI model benchmarking. If benchmark questions have wrong answers or are unverifiable, it becomes impossible to distinguish genuine capability improvements from models that have simply memorized the idiosyncrasies of flawed datasets. The AI community is increasingly calling for more rigorous validation processes before benchmark data is accepted as ground truth.
Related Articles
A r/LocalLLaMA benchmark compared 21 local coding models on HumanEval+, speed, and memory, putting Qwen 3.6 35B-A3B on top while surfacing practical RAM and tok/s trade-offs.
r/LocalLLaMA pushed this past 900 points because it was not another score table. The hook was a local coding agent noticing and fixing its own canvas and wave-completion bugs.
r/LocalLLaMA pushed this post up because the “trust me bro” report had real operating conditions: 8-bit quantization, 64k context, OpenCode, and Android debugging.
Comments (0)
No comments yet. Be the first to comment!