Anthropic says Claude solved 30% of biology tasks experts missed
Original: New on the Science Blog: We gave Claude 99 problems analyzing real biological data and compared its performance against an expert panel. On … View original →
Why the tweet mattered immediately
Biology benchmarks usually reward memorization or narrow problem solving, not the messy workflow a researcher faces when handed raw datasets. Anthropic's main X account pushed a different claim on April 29: Claude was tested on 99 real bioinformatics problems, and on 23 where human experts could not solve the task from scratch, the company's most recent models solved roughly 30%.
"On 23 problems, the experts were stumped. Our most recent models solved roughly 30% of those."
The linked Science Blog post explains why that number is more than a leaderboard flourish. Anthropic says BioMysteryBench draws from raw or minimally processed DNA, RNA, proteomics and metabolomics datasets, then asks questions with objective ground truth rather than open-ended scientific opinion. Up to five domain experts attempted each task. After quality control, 76 tasks were labeled human-solvable and 23 remained human-difficult. The post says current Claude generations are now roughly on par with human experts overall, and sometimes reach correct answers through different analytical routes.
Why this raises the bar for AI-in-science evals
The setup around the model is what makes the benchmark interesting. Anthropic says Claude runs inside a container with canonical bioinformatics tools, can install extra packages with pip or conda, and can access public databases such as NCBI and Ensembl. That is much closer to a working computational biology environment than a multiple-choice test. In several examples, Anthropic says human experts leaned on standard annotation tools while Claude recognized useful patterns or sequences through a different line of reasoning.
The Anthropic account usually uses X to surface research that later feeds into product positioning, system cards, or broader safety claims, so this post reads like an early signal about where the company wants Claude to compete next. What to watch now is whether BioMysteryBench becomes a shared external yardstick across labs, and whether other model vendors publish comparable results on messy, tool-using biology tasks instead of cleaner academic benchmarks. Source: Anthropic source tweet · Anthropic research post
Related Articles
Anthropic put hard numbers on Claude's biology capability claims instead of vague lab hype. In 99 real-data bioinformatics problems, the company says experts were stumped on 23 and recent Claude models solved roughly 30% of that hardest slice.
Anthropic said on March 23, 2026 that it is launching a Science Blog focused on how AI is changing research practice and scientific discovery. The new blog will publish feature stories, workflow guides, and field notes, while also highlighting Anthropic's broader AI-for-science programs.
Anthropic said on March 23, 2026 that not every long-horizon task benefits from splitting work across many agents, and pointed to a sequential setup for modeling the early universe. In the linked research post, Anthropic describes using Claude Opus 4.6 with persistent memory, orchestration patterns, and test oracles to implement a differentiable cosmological Boltzmann solver.
Comments (0)
No comments yet. Be the first to comment!