OpenAI says 5 of 10 First Proof attempts may be correct after expert review
Original: Our First Proof submissions View original →
A harder test than standard math benchmarks
On February 20, 2026, OpenAI published its proof attempts for all 10 problems in First Proof, a research-level math challenge built to test whether AI systems can produce correct, checkable proofs on domain-specific problems. Unlike short-answer benchmark sets, these tasks require complete arguments that specialists can inspect line by line, which makes them a more demanding test of sustained reasoning and formal rigor.
OpenAI said the model was run on all 10 problems and that, based on expert feedback, at least five attempts on problems 4, 5, 6, 9, and 10 have a high chance of being correct. The company also disclosed a correction: it initially thought its attempt on problem 2 was likely correct, but now believes it is incorrect after reviewing official commentary and additional community analysis.
What the company believes it learned
The post argues that frontier research challenges reveal capabilities that ordinary benchmarks can hide. OpenAI says tasks like First Proof test whether a model can sustain long chains of reasoning, choose useful abstractions, deal with ambiguous problem statements, and produce arguments that survive expert scrutiny. That is a much stricter requirement than selecting an answer from a small candidate set.
James R. Lee of OpenAI described the exercise as a preview of a model in training whose primary goal is greater rigor in thinking. According to the post, the model first solved problems 9 and 10, then improved enough during training to solve at least three more. OpenAI highlighted problems 6 and 4 as especially notable because they came from fields familiar to the research team and showed visible gains over just a few days.
Not a clean benchmark, but still a meaningful signal
OpenAI was explicit that this was not a perfectly controlled evaluation. The company said the work involved limited human supervision, occasional suggestions to retry promising strategies, clarifications after expert feedback, and some use of ChatGPT for verification, formatting, and style. For several problems, humans selected the best attempt from a small set of runs.
Those caveats matter, but they do not erase the significance of the result. OpenAI is essentially arguing that research-grade reasoning should be evaluated in environments where correctness is difficult to verify and where failure modes are informative. That is a useful shift away from leaderboard optimization toward more realistic stress tests for future models.
Why this matters for frontier model evaluation
The release also connects First Proof to OpenAI's broader reasoning agenda. The company pointed to its July 2025 gold medal-level score on the International Mathematical Olympiad and to later math, physics, and science collaboration experiments. The practical implication is that OpenAI wants public evaluation to move closer to expert workflows, where a model must persuade specialists rather than merely outperform benchmarks.
Source: OpenAI official research post.
Related Articles
OpenAI published five model-generated submissions to the First Proof math challenge. None were accepted as valid solutions, but the release gives researchers direct evidence of where frontier reasoning systems succeed and fail.
OpenAI introduced a new evaluation suite and research paper on Chain-of-Thought controllability. The company says GPT-5.4 Thinking shows low ability to obscure its reasoning, which supports continued use of CoT monitoring as a safety signal.
OpenAI reports that, across more than one million ChatGPT conversations, the share of difficult interactions exceeding a human baseline increased roughly fourfold from September 2024 to January 2026. The company also shows large gains in case-interview and puzzle-style open tasks.
Comments (0)
No comments yet. Be the first to comment!