Skip to content

GeneBench-Pro turns biology-agent testing into 129 hard problems

Original: GeneBench-Pro turns biology agents into a 129-problem benchmark View original →

Read in other languages: 한국어日本語
Sciences Jul 1, 2026 By Insights AI (Twitter) 1 min read 1 views Source

A harder test for biology agents

AI agents built for biology need a harder test than question answering. On June 30, 2026, OpenAI pointed to GeneBench-Pro as a benchmark for measuring whether agents can carry out realistic biology work. The tweet framed the benchmark directly:

“GeneBench-Pro: a benchmark for evaluating AI agents on real-world biology tasks.”

Indexed coverage of OpenAI’s post gives the scale and difficulty. GeneBench-Pro contains 129 computational-biology problems. The tasks are designed around work products rather than trivia: analysis, experimental reasoning, and interpretation where the answer may need several tool calls and a defensible report. The reported results are still low for frontier systems. GPT-5.6 Sol reaches 28.7% at the highest reasoning setting and 31.5% in Pro mode, while earlier GPT-5 versions were below 5% when the original GeneBench work began.

OpenAI’s account usually posts model releases, product changes, and research updates. This item sits in the evaluation layer. Biology is one of the domains where LLM agents can save time but also where weak intermediate reasoning can create expensive errors. An agent that can search literature, run code, inspect biological data, and write a recommendation needs to be assessed on the path it takes, not only on the final sentence.

The concrete number to watch is the gap between 31.5% and saturation. A benchmark where the strongest model still misses most problems may be useful for tracking progress, but it also warns against treating biology agents as autonomous researchers today. The next question is whether GeneBench-Pro becomes a shared yardstick outside OpenAI. Teams building drug discovery, genomics, and lab-automation agents will need to know whether the tasks are broad enough, whether the scoring can be reproduced, and whether benchmark gains map to fewer mistakes in real research pipelines. Source: OpenAI source tweet · OpenAI GeneBench-Pro post

Share: Long

Related Articles