Anthropic pits Claude against 99 bio problems, clears 30% of expert stumpers

Original: New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those—and most of the rest. View original →

Read in other languages: 한국어日本語
Sciences Apr 30, 2026 By Insights AI 2 min read 1 views Source
Anthropic pits Claude against 99 bio problems, clears 30% of expert stumpers

Biology capability claims are usually vague. Anthropic put numbers on the table in a new X post, saying Claude was tested on 99 problems built around real biological data and benchmarked against an expert panel. The headline result was not just average performance: on 23 problems where the experts were stumped, Anthropic says its most recent models solved roughly 30% of them and got close on much of the rest. That turns a fuzzy safety and capability debate into a concrete benchmark worth tracking.

“We gave Claude 99 problems analyzing real biological data… On 23 problems, the experts were stumped.”

The source post is here: Anthropic on X. Anthropic’s main X account usually carries safety, eval, and interpretability work rather than consumer feature hype, so the framing matters. A companion post links BioMysteryBench, which Anthropic describes as a new bioinformatics evaluation for open-ended research problems. The linked research page is titled “Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench,” making clear that the company wants this read as a capability measurement on realistic scientific tasks, not as a detached marketing claim.

What stands out is the choice to spotlight the hardest slice of the benchmark instead of only a blended average. Anthropic is effectively saying the interesting question is no longer whether models can handle routine biology prompts, but how often they can help on cases that block trained humans. That is a much tougher bar for frontier models, and it is the sort of evidence regulators, lab partners, and external evaluators will want when they ask where advanced systems are becoming useful in wet-lab adjacent work.

The next thing to watch is external scrutiny. BioMysteryBench will matter far more if Anthropic or outside researchers publish model-by-model breakdowns, failure modes, and replication results over time. For now, the tweet gives the cleanest takeaway: Claude is no longer being framed only as a chat interface or coding copilot. Anthropic is making a direct claim that its latest models can contribute on a measurable slice of hard bioinformatics work. Track the primary post here.

Share: Long

Related Articles

Sciences sources.twitter Mar 27, 2026 2 min read

Anthropic said on March 23, 2026 that not every long-horizon task benefits from splitting work across many agents, and pointed to a sequential setup for modeling the early universe. In the linked research post, Anthropic describes using Claude Opus 4.6 with persistent memory, orchestration patterns, and test oracles to implement a differentiable cosmological Boltzmann solver.

Sciences Apr 14, 2026 2 min read

OpenAI says ChatGPT is already being used at research scale across science and mathematics. In its January 2026 report, the company says advanced science and math usage reached nearly 8.4 million weekly messages from roughly 1.3 million weekly users, with early evidence that GPT-5.2 is contributing to serious mathematical work.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment