Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

Original: The Qwen team verified that there are serious problems with the data quality of the GPQA and HLE test sets. View original →

Read in other languages: 한국어日本語
LLM Feb 23, 2026 By Insights AI (Reddit) 1 min read 1 views Source

Cracks in Widely Used AI Benchmarks

The GPQA and HLE (Humanity's Last Exam) benchmarks, widely used to evaluate AI model performance, contain serious data quality issues — a fact now confirmed by the Qwen research team in a published paper (arXiv: 2602.13964v2).

How the Problem Was Found

The issue was first surfaced about a month ago by a researcher running an experiment called "DeepSeek-Overclock" — an attempt to push DeepSeek's reasoning capabilities to the absolute limit. The optimized model kept failing, but logs revealed it was not hallucinating. Instead, it was deriving technically correct answers that simply contradicted the provided gold-standard labels.

The researcher wrote Python scripts to verify the math line-by-line from first principles, finding that the dataset's answer labels were simply wrong in many cases. The Qwen team's paper has now formally confirmed these findings.

What's Wrong with the Data

The problems are multi-layered. OCR errors were introduced when creating questions. Some standard answers are straightforwardly incorrect. An analysis by FutureHouse found that only 51.3% of HLE questions are actually supported by research. Some questions are fundamentally flawed or structured in ways that make verification impossible.

Implications for AI Evaluation

This finding raises fundamental questions about the reliability of current AI model benchmarking. If benchmark questions have wrong answers or are unverifiable, it becomes impossible to distinguish genuine capability improvements from models that have simply memorized the idiosyncrasies of flawed datasets. The AI community is increasingly calling for more rigorous validation processes before benchmark data is accepted as ground truth.

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.