Google Research finds curated-source systems beat open-web LLMs on superconductivity questions
Original: Testing LLMs on superconductivity research questions View original →
What happened
Google Research published a study on March 16, 2026 asking a practical question: can LLMs act as credible research partners in an open scientific field? The team used high-temperature superconductivity as the test case and evaluated six systems on expert-level questions, with results published in the Proceedings of the National Academy of Sciences.
The most important finding was not just which model won, but why. The top performers were NotebookLM and a custom retrieval-augmented generation system built on curated, quality-controlled sources. Systems with broader web access tended to mix established theories with more speculative claims, which reduced their usefulness for a field where unresolved debates and historical context matter.
How the study worked
- Experts assembled 15 review articles to provide a quality-controlled overview of the field.
- Those materials were expanded into roughly 1,726 curated sources for the closed systems.
- Web-connected systems had access to 765 open-access experimental papers and 1,553 open-access theoretical papers.
- A panel of experts wrote 67 questions and scored answers for balance, comprehensiveness, conciseness, and evidence quality.
The evaluated systems included GPT-4o, Perplexity, Claude 3.5, Gemini Advanced Pro 1.5, NotebookLM, and a custom RAG system. Google said NotebookLM performed especially well because it grounded answers in a constrained library of sources, producing responses that experts judged to be more balanced and better referenced. At the same time, the paper noted clear limitations across all systems, especially around temporal understanding, table and figure interpretation, and the tendency to miss relevant literature when phrasing changed.
Why it matters
This result is highly relevant for scientific AI products. It suggests that trustworthy performance in complex domains may depend less on unrestricted web access and more on carefully curated corpora, citation discipline, and retrieval design. In other words, the winning stack for scientific reasoning may be model plus knowledge architecture, not model alone.
For Insights readers, that is the bigger signal. As AI moves deeper into biology, materials science, physics, and medicine, product teams may need to invest as much in evidence workflows and expert source curation as they do in frontier-model selection.
Related Articles
Google DeepMind unveiled Gemini for Science at I/O 2026, a suite of experimental AI tools designed to help scientists explore hypotheses, validate work at scale, and analyze scientific literature.
On Feb. 12, 2026, Google announced a major Gemini 3 Deep Think upgrade for science, research, and engineering. The new version is available in the Gemini app for Google AI Ultra subscribers and, for the first time, via early API access for researchers, engineers, and enterprises.
Google DeepMind unveiled an AI Co-Mathematician system — a multi-agent Gemini-based framework scoring 48% on FrontierMath Tier 4, the highest ever for any AI. AlphaEvolve improved lower bounds on five Ramsey numbers, including R(3,13) whose previous record had stood for 11 years.