Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks
Original: The Qwen team verified that there are serious problems with the data quality of the GPQA and HLE test sets. View original →
Cracks in Widely Used AI Benchmarks
The GPQA and HLE (Humanity's Last Exam) benchmarks, widely used to evaluate AI model performance, contain serious data quality issues — a fact now confirmed by the Qwen research team in a published paper (arXiv: 2602.13964v2).
How the Problem Was Found
The issue was first surfaced about a month ago by a researcher running an experiment called "DeepSeek-Overclock" — an attempt to push DeepSeek's reasoning capabilities to the absolute limit. The optimized model kept failing, but logs revealed it was not hallucinating. Instead, it was deriving technically correct answers that simply contradicted the provided gold-standard labels.
The researcher wrote Python scripts to verify the math line-by-line from first principles, finding that the dataset's answer labels were simply wrong in many cases. The Qwen team's paper has now formally confirmed these findings.
What's Wrong with the Data
The problems are multi-layered. OCR errors were introduced when creating questions. Some standard answers are straightforwardly incorrect. An analysis by FutureHouse found that only 51.3% of HLE questions are actually supported by research. Some questions are fundamentally flawed or structured in ways that make verification impossible.
Implications for AI Evaluation
This finding raises fundamental questions about the reliability of current AI model benchmarking. If benchmark questions have wrong answers or are unverifiable, it becomes impossible to distinguish genuine capability improvements from models that have simply memorized the idiosyncrasies of flawed datasets. The AI community is increasingly calling for more rigorous validation processes before benchmark data is accepted as ground truth.
Related Articles
A high-scoring LocalLLaMA post benchmarked Qwen3.5-27B Q4 GGUF variants against BF16, separating “closest-to-baseline” choices from “best efficiency” picks for constrained VRAM setups.
Alibaba launched Qwen3.5, a 397B-parameter open-weight multimodal model supporting 201 languages. The company claims it outperforms GPT-5.2, Claude Opus 4.5, and Gemini 3 on benchmarks, while costing 60% less than its predecessor.
A widely-shared r/LocalLLaMA comparison of Qwen's smallest models across three generations (score: 681) reveals extraordinary efficiency gains. The Qwen 3.5 9B now outperforms the previous-generation 80B on several benchmarks, while the 2B handles video understanding better than many 7B models.
Comments (0)
No comments yet. Be the first to comment!