The Qwen research team has officially confirmed through a published paper that GPQA and HLE (Humanity's Last Exam) benchmark datasets contain serious quality issues — including OCR errors, incorrect gold-standard answers, and unverifiable questions — casting doubt on the reliability of current AI model evaluations.
#qwen
RSS FeedLLM Reddit Feb 23, 2026 1 min read
LLM Feb 22, 2026 1 min read
Alibaba launched Qwen 3.5 on February 16 under Apache 2.0, featuring 397B parameters with a sparse MoE architecture (17B active), 256K context, and native multimodal capabilities matching leading US proprietary models on key benchmarks.
LLM Reddit Feb 17, 2026 2 min read
A r/LocalLLaMA post on Qwen3.5 gained 123 upvotes and pointed directly to public weights and model documentation. The linked card confirms key specs including 397B total parameters, 17B activated, and 262,144 native context length.