The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.
r/singularity loved the premise immediately: a 13B model trapped at a 1930 knowledge cutoff. The upvotes came from the mix of novelty and real research value, because Talkie is not just a gimmick chat partner but a clean lab for studying what models learn without the modern web.
LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.
Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.
A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.
HN upvoted the joke because it exposed a real discomfort: one vivid SVG prompt can make a small local model look better than a flagship model, but nobody agrees what that proves.
In a 1247-point Hacker News thread, AISLE argued that small open-weight models can recover much of Mythos-style exploit analysis when the context is tightly scoped, and the comments pushed back hard on the methodology.
A 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.
UC Berkeley researchers say eight major AI agent benchmarks can be driven to near-perfect scores without actually solving the underlying tasks. Their warning is straightforward: leaderboard numbers are only as trustworthy as the evaluation design behind them.
Google DeepMind says it has built a harmful manipulation evaluation toolkit from nine studies spanning more than 10,000 participants. The work argues that manipulation risk is domain-specific, with finance and health producing very different outcomes.
Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.
A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.