OpenAI shares First Proof submissions for all 10 research-level math problems

Original: Our First Proof submissions View original →

Read in other languages: 한국어日本語
LLM Mar 16, 2026 By Insights AI 2 min read 1 views Source

What OpenAI Released

OpenAI said on February 20, 2026 that it is publishing its model’s submissions for First Proof, a set of research-level mathematics problems that the company says includes questions that took years for their original authors to solve. OpenAI first shared the proof attempts on February 14, then updated its assessment after receiving expert feedback and community analysis.

According to OpenAI, its internal theorem-proving model generated attempts for all 10 problems in the set. The company now says the submissions for problems 4, 5, 6, 9, and 10 have a high chance of being correct, while several others are still being evaluated. OpenAI also said that one submission that initially looked promising, the attempt for problem 2, now appears to be incorrect after comparison with official commentary and outside review.

Why First Proof Matters

OpenAI argues that standard math benchmarks do not fully capture the kinds of reasoning that matter for research-grade work. In its framing, First Proof tests long chains of reasoning, selecting the right abstractions, handling ambiguity, and producing arguments strong enough to survive expert scrutiny. That makes it a very different target from short benchmark questions that mostly reward narrow pattern matching or standard competition tactics.

The company places this release in the context of its July 2025 IMO result, when OpenAI said one of its general-purpose reasoning models reached gold-medal-level performance with a score of 35 out of 42 while formalizing and proving IMO problems from natural-language statements. First Proof is presented as a harder next step, aimed at measuring whether model reasoning can move from elite competition settings toward open-ended mathematical research.

Why It Matters

The significance here is less about a single leaderboard and more about how frontier reasoning systems are being evaluated. Producing proof attempts that experts take seriously is a higher bar than answering many short-form questions correctly. OpenAI’s update does not claim all results are settled, but it does suggest theorem proving is becoming one of the clearest arenas for testing whether LLM-style systems can sustain long, structured reasoning under real scrutiny.

Source: OpenAI

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.