#hle - Insights

LLM Reddit Feb 23, 2026 1 min read

Qwen Team Confirms Serious Data Quality Problems in GPQA and HLE Benchmarks

The Qwen research team has officially confirmed through a published paper that GPQA and HLE (Humanity's Last Exam) benchmark datasets contain serious quality issues — including OCR errors, incorrect gold-standard answers, and unverifiable questions — casting doubt on the reliability of current AI model evaluations.

#qwen #benchmark #gpqa