#evaluation

LLM Mar 25, 2026 2 min read

Microsoft Research open-sources AgentRx to pinpoint where AI agents first fail

Microsoft Research has open-sourced AgentRx, a framework for pinpointing the first critical failure in long AI-agent trajectories. It ships with a 115-trajectory benchmark and reports gains in both failure localization and root-cause attribution.

#agents #debugging #opensource

Sciences Reddit Mar 21, 2026 2 min read

r/MachineLearning Warns Biased Labels Can Hide Medical AI Failures in Breast Cancer Segmentation

A Reddit discussion around a new medical segmentation paper argues that using automated labels for both training and evaluation can hide age-related disparities, making younger-patient performance look better than it really is.

#medical-ai #fairness #segmentation

AI Mar 19, 2026 2 min read

Google DeepMind proposes a cognitive framework for measuring AGI progress

Google DeepMind said on March 17, 2026 that it has published a new cognitive-science framework for evaluating progress toward AGI and launched a Kaggle hackathon to turn that framework into practical benchmarks. The proposal defines 10 cognitive abilities, recommends comparison against human baselines, and puts $200,000 behind community-built evaluations.

#google-deepmind #agi #evaluation

AI sources.twitter Mar 17, 2026 2 min read

OpenAI moves to acquire Promptfoo to deepen agentic security testing

OpenAI said on March 9, 2026 that it plans to acquire Promptfoo. The company said Promptfoo's technology will strengthen agentic security testing and evaluation inside OpenAI Frontier, while Promptfoo remains open source under its current license and existing customers continue to receive support.

#openai #promptfoo #security

LLM Reddit Mar 14, 2026 2 min read

r/MachineLearning Questions Whether COCONUT’s “Latent Reasoning” Comes from Architecture or Curriculum

A r/MachineLearning post argues that Meta’s COCONUT results may owe more to curriculum design and sequential processing than to the headline mechanism of recycling hidden states as latent thought tokens.

#coconut #latent-reasoning #curriculum-learning

LLM Hacker News Mar 14, 2026 2 min read

Hacker News Debates Whether LLM Coding Progress Has Stalled on Maintainer Merge Rates

A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.

#swe-bench #coding-agents #evaluation

LLM Reddit Mar 13, 2026 2 min read

r/MachineLearning debates whether LLM benchmark papers age out before they matter

A high-scoring discussion in r/MachineLearning asks what benchmarking papers are for when proprietary models change monthly and old versions disappear. The strongest replies argued that model rankings go stale fast, but the datasets and failure cases can remain useful as durable eval assets.

#benchmarks #evaluation #llm-research

AI Reddit Mar 13, 2026 2 min read

Researchers Warn That 'Shadow APIs' Are Undermining LLM Reproducibility

A new paper discussed in r/MachineLearning argues that unofficial model-access providers can quietly substitute models and distort both research and production results.

#reproducibility #apis #research

AI sources.twitter Mar 10, 2026 1 min read

OpenAI Moves to Buy Promptfoo for Frontier Security and Evaluation

OpenAI said it will acquire Promptfoo and fold its security and evaluation stack into OpenAI Frontier. The company said Promptfoo will remain open source and current customers will continue to receive support.

#openai #promptfoo #security

LLM sources.twitter Mar 5, 2026 1 min read

OpenAI Releases CoT-Controllability Evaluation, Citing Low Obfuscation in GPT-5.4 Thinking

OpenAI said it published a new Chain-of-Thought controllability evaluation suite and research paper. The company reports that GPT-5.4 Thinking showed limited ability to obscure its reasoning, supporting chain-of-thought monitoring as a practical safety mechanism.

#openai #ai-safety #chain-of-thought

LLM Mar 5, 2026 1 min read

Anthropic Details AI-Resistant Technical Evaluations for Engineering Hiring

In a January 21, 2026 engineering post, Anthropic explained how it repeatedly redesigned a take-home performance test as Claude models improved. The company describes how Opus 4 and Opus 4.5 changed the evaluation baseline and forced process-level updates.

#anthropic #claude #evaluation

LLM Reddit Mar 4, 2026 2 min read

LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.

#swe-bench #coding-agents #qwen