OpenAI’s February 13, 2026 research update outlines a practical use of language models for social science operations: deciding which studies should be replicated first. Instead of treating AI as a substitute for empirical work, the project frames models as a triage layer that helps researchers allocate scarce replication budgets more effectively.

According to the release, the team executed more than one million synthetic evaluations over papers from more than 160 political science journals. The model was asked to predict likely outcomes from each paper’s title and abstract, then those predictions were compared with observed sample-level findings. Cases where predictions diverged from observed outcomes were treated as high-value candidates for deeper replication review.

This is a meaningful shift in workflow design. In many fields, replication backlogs are constrained by labor and funding, not by a lack of potential targets. If disagreement can be used as a screening signal, labs can prioritize follow-up experiments where context shifts, sampling differences, or fragile effect sizes are most likely to matter.

OpenAI also shared an initial benchmark: a 30-study set drawn from major journals across 2018-2025. In the company’s report, GPT-5.2 in a zero-shot setting reached about 75% predictive accuracy on that benchmark. The important point is not that a model can “settle” contested findings, but that it may help teams quickly rank where additional human verification is likely to produce the most scientific value.

For institutions running large replication programs, this approach could reduce time-to-decision at the portfolio level. It could also improve transparency if model scores, disagreement thresholds, and final selection criteria are logged and published alongside replication outcomes.

Open questions remain. Cross-discipline transfer beyond political science must be tested carefully. Benchmark design can introduce publication and sampling bias if journal coverage is uneven. Most importantly, downstream impact must be measured: does model-guided triage actually increase replication yield per dollar and per researcher hour? Even with those caveats, the release signals a concrete direction for AI-assisted scientific governance rather than headline-only experimentation.

#replication

OpenAI Scales Social Science Replication Triage With 1M+ Synthetic Evaluations