OpenAI says 5 of 10 First Proof attempts may be correct after expert review

A harder test than standard math benchmarks

On February 20, 2026, OpenAI published its proof attempts for all 10 problems in First Proof, a research-level math challenge built to test whether AI systems can produce correct, checkable proofs on domain-specific problems. Unlike short-answer benchmark sets, these tasks require complete arguments that specialists can inspect line by line, which makes them a more demanding test of sustained reasoning and formal rigor.

OpenAI said the model was run on all 10 problems and that, based on expert feedback, at least five attempts on problems 4, 5, 6, 9, and 10 have a high chance of being correct. The company also disclosed a correction: it initially thought its attempt on problem 2 was likely correct, but now believes it is incorrect after reviewing official commentary and additional community analysis.

What the company believes it learned

The post argues that frontier research challenges reveal capabilities that ordinary benchmarks can hide. OpenAI says tasks like First Proof test whether a model can sustain long chains of reasoning, choose useful abstractions, deal with ambiguous problem statements, and produce arguments that survive expert scrutiny. That is a much stricter requirement than selecting an answer from a small candidate set.

James R. Lee of OpenAI described the exercise as a preview of a model in training whose primary goal is greater rigor in thinking. According to the post, the model first solved problems 9 and 10, then improved enough during training to solve at least three more. OpenAI highlighted problems 6 and 4 as especially notable because they came from fields familiar to the research team and showed visible gains over just a few days.

Not a clean benchmark, but still a meaningful signal

OpenAI was explicit that this was not a perfectly controlled evaluation. The company said the work involved limited human supervision, occasional suggestions to retry promising strategies, clarifications after expert feedback, and some use of ChatGPT for verification, formatting, and style. For several problems, humans selected the best attempt from a small set of runs.

Those caveats matter, but they do not erase the significance of the result. OpenAI is essentially arguing that research-grade reasoning should be evaluated in environments where correctness is difficult to verify and where failure modes are informative. That is a useful shift away from leaderboard optimization toward more realistic stress tests for future models.

Why this matters for frontier model evaluation

The release also connects First Proof to OpenAI's broader reasoning agenda. The company pointed to its July 2025 gold medal-level score on the International Mathematical Olympiad and to later math, physics, and science collaboration experiments. The practical implication is that OpenAI wants public evaluation to move closer to expert workflows, where a model must persuade specialists rather than merely outperform benchmarks.

Source: OpenAI official research post.

OpenAI says 5 of 10 First Proof attempts may be correct after expert review

A harder test than standard math benchmarks

What the company believes it learned

Not a clean benchmark, but still a meaningful signal

Why this matters for frontier model evaluation

Related Articles

OpenAI publishes First Proof model submissions

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months

Comments (0)

Leave a Comment

Related Articles

OpenAI publishes First Proof model submissions
LLM Feb 20, 2026 2 min read

OpenAI says GPT-5.4 Thinking shows low chain-of-thought controllability in new safety study

OpenAI: High-Difficulty ChatGPT Reasoning Interactions Rose 4x in 16 Months
LLM Feb 16, 2026 2 min read