Anthropic says Claude solved 30% of biology tasks experts missed

Why the tweet mattered immediately

Biology benchmarks usually reward memorization or narrow problem solving, not the messy workflow a researcher faces when handed raw datasets. Anthropic's main X account pushed a different claim on April 29: Claude was tested on 99 real bioinformatics problems, and on 23 where human experts could not solve the task from scratch, the company's most recent models solved roughly 30%.

"On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those."

The linked Science Blog post explains why that number is more than a leaderboard flourish. Anthropic says BioMysteryBench draws from raw or minimally processed DNA, RNA, proteomics and metabolomics datasets, then asks questions with objective ground truth rather than open-ended scientific opinion. Up to five domain experts attempted each task. After quality control, 76 tasks were labeled human-solvable and 23 remained human-difficult. The post says current Claude generations are now roughly on par with human experts overall, and sometimes reach correct answers through different analytical routes.

Why this raises the bar for AI-in-science evals

The setup around the model is what makes the benchmark interesting. Anthropic says Claude runs inside a container with canonical bioinformatics tools, can install extra packages with pip or conda, and can access public databases such as NCBI and Ensembl. That is much closer to a working computational biology environment than a multiple-choice test. In several examples, Anthropic says human experts leaned on standard annotation tools while Claude recognized useful patterns or sequences through a different line of reasoning.

The Anthropic account usually uses X to surface research that later feeds into product positioning, system cards, or broader safety claims, so this post reads like an early signal about where the company wants Claude to compete next. What to watch now is whether BioMysteryBench becomes a shared external yardstick across labs, and whether other model vendors publish comparable results on messy, tool-using biology tasks instead of cleaner academic benchmarks. Source: Anthropic source tweet · Anthropic research post

Anthropic says Claude solved 30% of biology tasks experts missed

Why the tweet mattered immediately

Why this raises the bar for AI-in-science evals

Related Articles

Anthropic pits Claude against 99 bio problems, clears 30% of expert stumpers

Anthropic launches a Science Blog to cover AI-driven research workflows and results

Anthropic shows how a single long-running Claude agent can tackle scientific computing

Comments (0)

Leave a Comment