A post in r/MachineLearning argues that duplicating a specific seven-layer block inside Qwen2-72B improved benchmark performance without changing any weights.
#benchmarks
NIST says AI 800-3 gives evaluators a clearer statistical framework by separating benchmark accuracy from generalized accuracy and by introducing generalized linear mixed models for uncertainty estimation. The February 19, 2026 report argues that many current benchmark comparisons hide assumptions that can distort procurement, development, and policy decisions.
Google says Gemini in Google Sheets reached 70.48% on the full SpreadsheetBench benchmark, approaching human expert ability. The company attributes the result to product-specific tuning plus stronger verbalization and coding behavior inside Sheets.
A fast-rising LocalLLaMA post resurfaced David Noel Ng's write-up on duplicating a seven-layer block inside Qwen2-72B, a no-training architecture tweak that reportedly lifted multiple Open LLM Leaderboard benchmarks.
Hacker News highlighted SWE-CI, an arXiv benchmark that evaluates whether LLM agents can sustain repository quality across CI-driven iterations, not just land a single passing patch.
OpenAI announced GPT-5.4 on March 5, 2026, adding a new general-purpose model and GPT-5.4 Pro with stronger computer use, tool search efficiency, and benchmark improvements over GPT-5.2.
A high-engagement r/LocalLLaMA thread reviewed Unsloth’s updated Qwen3.5-35B-A3B dynamic quantization release, including KLD/PPL data, tensor-level tradeoffs, and reproducibility artifacts.
OpenAI announced GPT-5 on 2025-08-07 for both ChatGPT and API usage. The launch highlights include a reported 45% hallucination reduction vs GPT-4o and major benchmark gains such as HealthBench Hard 44.6.
A high-ranking r/singularity post shared Google’s Gemini 3 Deep Think update. The announcement includes benchmark claims such as 48.4% on Humanity’s Last Exam (without tools), 84.6% on ARC-AGI-2, and Codeforces Elo 3455, plus Gemini API early access.
Google announced a major Gemini 3 Deep Think upgrade with stronger reasoning benchmarks and early API access for researchers and enterprises.
China's GLM-5 model achieves a score of 50 on the Intelligence Index, claiming top performance among open-source large language models.