Articles

All AI LLM Humanoid Robots Sciences Gaming Finance

Source:

From To

LLM May 27, 2026 2 min read

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

The weak point in model leaderboards may be the tasks, not only the models. A new arXiv paper reports critical issues in more than 25.7% of evaluated benchmark tasks and shows ranking shifts after filtering flawed items.

#benchmarks #swe-bench #agents

LLM Reddit Apr 28, 2026 3 min read

r/singularity Is Hooked on Talkie, a 13B Model Frozen in 1930

r/singularity loved the premise immediately: a 13B model trapped at a 1930 knowledge cutoff. The upvotes came from the mix of novelty and real research value, because Talkie is not just a gimmick chat partner but a clean lab for studying what models learn without the modern web.

#talkie #language-models #historical-data

LLM Reddit Apr 27, 2026 2 min read

LocalLLaMA Calls SWE-bench Verified “Benchmaxxed” as Benchmark Trust Cracks

LocalLLaMA’s reaction was almost resigned: of course the public benchmark got benchmaxxed. What mattered was seeing contamination and flawed tests laid out in numbers big enough that the old bragging rights no longer looked stable.

#swe-bench #benchmarks #contamination

LLM Apr 17, 2026 2 min read

LLM judges hide instability: 33-67% of documents break consistency

A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.

#llm #evaluation #benchmarks

LLM Hacker News Apr 17, 2026 2 min read

Qwen3.6 pelican test turned HN into a benchmark argument

HN upvoted the joke because it exposed a real discomfort: one vivid SVG prompt can make a small local model look better than a flagship model, but nobody agrees what that proves.

#qwen #claude #local-llms

AI Hacker News Apr 13, 2026 2 min read

Hacker News debates whether small open models can already reproduce parts of Mythos-style AI security work

In a 1247-point Hacker News thread, AISLE argued that small open-weight models can recover much of Mythos-style exploit analysis when the context is tightly scoped, and the comments pushed back hard on the methodology.

#cybersecurity #open-models #llm

AI Hacker News Apr 13, 2026 2 min read

Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

A 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.

#ai-agents #benchmarks #evaluation

AI Hacker News Apr 12, 2026 1 min read

Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores

UC Berkeley researchers say eight major AI agent benchmarks can be driven to near-perfect scores without actually solving the underlying tasks. Their warning is straightforward: leaderboard numbers are only as trustworthy as the evaluation design behind them.

#benchmarks #ai-agents #evaluation

AI X/Twitter Mar 30, 2026 2 min read

Google DeepMind publishes a harmful manipulation evaluation toolkit built on nine studies with 10,000 participants

Google DeepMind says it has built a harmful manipulation evaluation toolkit from nine studies spanning more than 10,000 participants. The work argues that manipulation risk is domain-specific, with finance and health producing very different outcomes.

#google-deepmind #ai-safety #manipulation

LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.

#benchmarks #memory-systems #evaluation

AI X/Twitter Mar 26, 2026 2 min read

Google DeepMind releases a real-world toolkit to measure harmful AI manipulation

Google DeepMind said on March 26, 2026 that it is releasing research on how conversational AI might exploit emotions or manipulate people into harmful choices. The company says it built the first empirically validated toolkit to measure harmful AI manipulation, based on nine studies with more than 10,000 participants across the UK, the US, and India.

#google-deepmind #ai-safety #manipulation

LLM Mar 25, 2026 2 min read

Microsoft Research open-sources AgentRx to pinpoint where AI agents first fail

Microsoft Research has open-sourced AgentRx, a framework for pinpointing the first critical failure in long AI-agent trajectories. It ships with a 115-trajectory benchmark and reports gains in both failure localization and root-cause attribution.

#agents #debugging #opensource