Claude Fable 5 has moved to the top of Artificial Analysis’s GDPval-AA benchmark with a 1932 score. The result puts Anthropic models in three of the top four slots and raises the bar for long-running agentic knowledge work.
Claude Fable 5 has moved to the top of Artificial Analysis’s GDPval-AA benchmark with a 1932 score. The result puts Anthropic models in three of the top four slots and raises the bar for long-running agentic knowledge work.
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
NMR analysis is a slow chemistry bottleneck, and Anthropic says Opus 4.7 matched or beat specialist tools on parts of a 20-compound test. Its hydrogen NMR average error was about plus or minus 0.079 ppm.
ARC Prize put Anthropic Opus 4.8 at the top of ARC-AGI-3, but the score shows how hard the benchmark remains. The new mark is 1.5% at roughly $10K, with progress tied to object-and-system abstraction rather than image-level pattern matching.
Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.
Claude Opus 4.8 is showing its strongest early signal in agentic work, not only coding. Artificial Analysis says the model scored 1890 on GDPval-AA, 121 points ahead of GPT-5.5 xhigh.
DeepSWE reframes coding-agent evaluation with 113 original tasks across 91 repositories. Its first board gives GPT-5.5 a 70.0% pass@1 score, versus 54.2% for Claude Opus 4.7.
At Google I/O 2026 on May 19, Google unveiled Gemini 3.5 Flash—which outperforms Gemini 3.1 Pro across all benchmarks at 4× the speed and half the API cost—alongside Gemini Spark, a 24/7 personal AI agent that works in the background and can be reached directly via Gmail. Spark enters beta for Google AI Ultra subscribers in the US starting the week of May 26.
A practical benchmark from ModelRift tested six AI coding tools on parametric 3D Pantheon modeling, crowning Google Antigravity 2.0 as the best autonomous performer with a quality score of 4.5/5 — the only tool to include the interior coffered ceiling.
Google launched Gemini 3.5 Flash at I/O 2026 on May 19, making it generally available the same day. It outperforms Gemini 3.1 Pro on coding and agentic benchmarks while running 4x faster at 40% lower cost.
PwC will roll out Claude Code and Cowork across its global workforce, launching a 30,000-staff certification program and a joint Center of Excellence. Insurance underwriting cycles already cut from 10 weeks to 10 days.
A new DELEGATE-52 benchmark study finds that even frontier LLMs like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 corrupt an average of 25% of document content during long delegated workflows, with errors compounding silently.