Google Research finds curated-source systems beat open-web LLMs on superconductivity questions

Original: Testing LLMs on superconductivity research questions View original →

Read in other languages: 한국어日本語
Sciences Mar 25, 2026 By Insights AI 2 min read 1 views Source

What happened

Google Research published a study on March 16, 2026 asking a practical question: can LLMs act as credible research partners in an open scientific field? The team used high-temperature superconductivity as the test case and evaluated six systems on expert-level questions, with results published in the Proceedings of the National Academy of Sciences.

The most important finding was not just which model won, but why. The top performers were NotebookLM and a custom retrieval-augmented generation system built on curated, quality-controlled sources. Systems with broader web access tended to mix established theories with more speculative claims, which reduced their usefulness for a field where unresolved debates and historical context matter.

How the study worked

  • Experts assembled 15 review articles to provide a quality-controlled overview of the field.
  • Those materials were expanded into roughly 1,726 curated sources for the closed systems.
  • Web-connected systems had access to 765 open-access experimental papers and 1,553 open-access theoretical papers.
  • A panel of experts wrote 67 questions and scored answers for balance, comprehensiveness, conciseness, and evidence quality.

The evaluated systems included GPT-4o, Perplexity, Claude 3.5, Gemini Advanced Pro 1.5, NotebookLM, and a custom RAG system. Google said NotebookLM performed especially well because it grounded answers in a constrained library of sources, producing responses that experts judged to be more balanced and better referenced. At the same time, the paper noted clear limitations across all systems, especially around temporal understanding, table and figure interpretation, and the tendency to miss relevant literature when phrasing changed.

Why it matters

This result is highly relevant for scientific AI products. It suggests that trustworthy performance in complex domains may depend less on unrestricted web access and more on carefully curated corpora, citation discipline, and retrieval design. In other words, the winning stack for scientific reasoning may be model plus knowledge architecture, not model alone.

For Insights readers, that is the bigger signal. As AI moves deeper into biology, materials science, physics, and medicine, product teams may need to invest as much in evidence workflows and expert source curation as they do in frontier-model selection.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.