#evaluation

LLM 1d ago 2 min read

Anthropic stress-tests Claude for elections, hits 100% and 99.8%

Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.

#anthropic #claude #elections

LLM sources.research Apr 17, 2026 2 min read

LLM judges hide instability: 33-67% of documents break consistency

A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.

#llm #evaluation #benchmarks

LLM Hacker News Apr 17, 2026 2 min read

Qwen3.6 pelican test turned HN into a benchmark argument

HN upvoted the joke because it exposed a real discomfort: one vivid SVG prompt can make a small local model look better than a flagship model, but nobody agrees what that proves.

#qwen #claude #local-llms

AI Hacker News Apr 13, 2026 2 min read

Hacker News debates whether small open models can already reproduce parts of Mythos-style AI security work

In a 1247-point Hacker News thread, AISLE argued that small open-weight models can recover much of Mythos-style exploit analysis when the context is tightly scoped, and the comments pushed back hard on the methodology.

#cybersecurity #open-models #llm

AI Hacker News Apr 13, 2026 2 min read

Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

A 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.

#ai-agents #benchmarks #evaluation

AI Hacker News Apr 12, 2026 1 min read

Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores

UC Berkeley researchers say eight major AI agent benchmarks can be driven to near-perfect scores without actually solving the underlying tasks. Their warning is straightforward: leaderboard numbers are only as trustworthy as the evaluation design behind them.

#benchmarks #ai-agents #evaluation

AI sources.twitter Mar 30, 2026 2 min read

Google DeepMind publishes a harmful manipulation evaluation toolkit built on nine studies with 10,000 participants

Google DeepMind says it has built a harmful manipulation evaluation toolkit from nine studies spanning more than 10,000 participants. The work argues that manipulation risk is domain-specific, with finance and health producing very different outcomes.

#google-deepmind #ai-safety #manipulation

LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.

#benchmarks #memory-systems #evaluation

LLM Reddit Mar 28, 2026 2 min read

r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.

#benchmarks #evaluation #long-context

AI sources.twitter Mar 26, 2026 2 min read

Google DeepMind releases a real-world toolkit to measure harmful AI manipulation

Google DeepMind said on March 26, 2026 that it is releasing research on how conversational AI might exploit emotions or manipulate people into harmful choices. The company says it built the first empirically validated toolkit to measure harmful AI manipulation, based on nine studies with more than 10,000 participants across the UK, the US, and India.

#google-deepmind #ai-safety #manipulation

AI Hacker News Mar 26, 2026 2 min read

Hacker News spotlights ARC-AGI-3, a new agent benchmark built around interaction and adaptation

ARC Prize says ARC-AGI-3 is an interactive reasoning benchmark that measures planning, memory compression, and belief updating inside novel environments rather than static puzzle answers. Hacker News pushed the launch because it gives agent builders a more behavior-first way to compare systems against humans.

#arc-agi #benchmark #agents

LLM Mar 25, 2026 2 min read

Microsoft Research open-sources AgentRx to pinpoint where AI agents first fail

Microsoft Research has open-sourced AgentRx, a framework for pinpointing the first critical failure in long AI-agent trajectories. It ships with a 115-trajectory benchmark and reports gains in both failure localization and root-cause attribution.

#agents #debugging #opensource