Anthropic says Claude solved 30% of biology tasks experts missed

Original: New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On … View original →

Read in other languages: 한국어日本語
Sciences May 1, 2026 By Insights AI 2 min read 1 views Source

Why the tweet mattered immediately

Biology benchmarks usually reward memorization or narrow problem solving, not the messy workflow a researcher faces when handed raw datasets. Anthropic's main X account pushed a different claim on April 29: Claude was tested on 99 real bioinformatics problems, and on 23 where human experts could not solve the task from scratch, the company's most recent models solved roughly 30%.

"On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those."

The linked Science Blog post explains why that number is more than a leaderboard flourish. Anthropic says BioMysteryBench draws from raw or minimally processed DNA, RNA, proteomics and metabolomics datasets, then asks questions with objective ground truth rather than open-ended scientific opinion. Up to five domain experts attempted each task. After quality control, 76 tasks were labeled human-solvable and 23 remained human-difficult. The post says current Claude generations are now roughly on par with human experts overall, and sometimes reach correct answers through different analytical routes.

Why this raises the bar for AI-in-science evals

The setup around the model is what makes the benchmark interesting. Anthropic says Claude runs inside a container with canonical bioinformatics tools, can install extra packages with pip or conda, and can access public databases such as NCBI and Ensembl. That is much closer to a working computational biology environment than a multiple-choice test. In several examples, Anthropic says human experts leaned on standard annotation tools while Claude recognized useful patterns or sequences through a different line of reasoning.

The Anthropic account usually uses X to surface research that later feeds into product positioning, system cards, or broader safety claims, so this post reads like an early signal about where the company wants Claude to compete next. What to watch now is whether BioMysteryBench becomes a shared external yardstick across labs, and whether other model vendors publish comparable results on messy, tool-using biology tasks instead of cleaner academic benchmarks. Source: Anthropic source tweet · Anthropic research post

Share: Long

Related Articles

Sciences X/Twitter Mar 27, 2026 2 min read

Anthropic said on March 23, 2026 that not every long-horizon task benefits from splitting work across many agents, and pointed to a sequential setup for modeling the early universe. In the linked research post, Anthropic describes using Claude Opus 4.6 with persistent memory, orchestration patterns, and test oracles to implement a differentiable cosmological Boltzmann solver.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment