OpenAI shares First Proof submissions for all 10 research-level math problems
Original: Our First Proof submissions View original →
What OpenAI Released
OpenAI said on February 20, 2026 that it is publishing its model’s submissions for First Proof, a set of research-level mathematics problems that the company says includes questions that took years for their original authors to solve. OpenAI first shared the proof attempts on February 14, then updated its assessment after receiving expert feedback and community analysis.
According to OpenAI, its internal theorem-proving model generated attempts for all 10 problems in the set. The company now says the submissions for problems 4, 5, 6, 9, and 10 have a high chance of being correct, while several others are still being evaluated. OpenAI also said that one submission that initially looked promising, the attempt for problem 2, now appears to be incorrect after comparison with official commentary and outside review.
Why First Proof Matters
OpenAI argues that standard math benchmarks do not fully capture the kinds of reasoning that matter for research-grade work. In its framing, First Proof tests long chains of reasoning, selecting the right abstractions, handling ambiguity, and producing arguments strong enough to survive expert scrutiny. That makes it a very different target from short benchmark questions that mostly reward narrow pattern matching or standard competition tactics.
The company places this release in the context of its July 2025 IMO result, when OpenAI said one of its general-purpose reasoning models reached gold-medal-level performance with a score of 35 out of 42 while formalizing and proving IMO problems from natural-language statements. First Proof is presented as a harder next step, aimed at measuring whether model reasoning can move from elite competition settings toward open-ended mathematical research.
Why It Matters
The significance here is less about a single leaderboard and more about how frontier reasoning systems are being evaluated. Producing proof attempts that experts take seriously is a higher bar than answering many short-form questions correctly. OpenAI’s update does not claim all results are settled, but it does suggest theorem proving is becoming one of the clearest arenas for testing whether LLM-style systems can sustain long, structured reasoning under real scrutiny.
Source: OpenAI
Related Articles
OpenAI introduced a new evaluation suite and research paper on Chain-of-Thought controllability. The company says GPT-5.4 Thinking shows low ability to obscure its reasoning, which supports continued use of CoT monitoring as a safety signal.
OpenAI said on March 5, 2026 that GPT-5.4 is rolling out across ChatGPT, the API, and Codex. The new model combines GPT-5.3-Codex coding capability with OpenAI’s mainline reasoning stack, adds native computer-use features, and introduces experimental 1M-token context in Codex.
OpenAI released proof attempts for all 10 First Proof problems and said expert feedback suggests at least five may be correct. The company positioned the result as a test of long-horizon reasoning beyond standard benchmarks.
Comments (0)
No comments yet. Be the first to comment!