Microsoft Research has open-sourced AgentRx, a framework for pinpointing the first critical failure in long AI-agent trajectories. It ships with a 115-trajectory benchmark and reports gains in both failure localization and root-cause attribution.
#evaluation
RSS FeedA Reddit discussion around a new medical segmentation paper argues that using automated labels for both training and evaluation can hide age-related disparities, making younger-patient performance look better than it really is.
Google DeepMind said on March 17, 2026 that it has published a new cognitive-science framework for evaluating progress toward AGI and launched a Kaggle hackathon to turn that framework into practical benchmarks. The proposal defines 10 cognitive abilities, recommends comparison against human baselines, and puts $200,000 behind community-built evaluations.
OpenAI said on March 9, 2026 that it plans to acquire Promptfoo. The company said Promptfoo's technology will strengthen agentic security testing and evaluation inside OpenAI Frontier, while Promptfoo remains open source under its current license and existing customers continue to receive support.
A r/MachineLearning post argues that Meta’s COCONUT results may owe more to curriculum design and sequential processing than to the headline mechanism of recycling hidden states as latent thought tokens.
A Hacker News thread amplified a March 12 analysis arguing that LLM coding progress looks much weaker when measured by maintainer merge decisions rather than test-passing SWE-bench scores.
A high-scoring discussion in r/MachineLearning asks what benchmarking papers are for when proprietary models change monthly and old versions disappear. The strongest replies argued that model rankings go stale fast, but the datasets and failure cases can remain useful as durable eval assets.
A new paper discussed in r/MachineLearning argues that unofficial model-access providers can quietly substitute models and distort both research and production results.
OpenAI said it will acquire Promptfoo and fold its security and evaluation stack into OpenAI Frontier. The company said Promptfoo will remain open source and current customers will continue to receive support.
OpenAI said it published a new Chain-of-Thought controllability evaluation suite and research paper. The company reports that GPT-5.4 Thinking showed limited ability to obscure its reasoning, supporting chain-of-thought monitoring as a practical safety mechanism.
In a January 21, 2026 engineering post, Anthropic explained how it repeatedly redesigned a take-home performance test as Claude models improved. The company describes how Opus 4 and Opus 4.5 changed the evaluation baseline and forced process-level updates.
A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.