Google Research finds curated-source systems beat open-web LLMs on superconductivity questions
Original: Testing LLMs on superconductivity research questions View original →
What happened
Google Research published a study on March 16, 2026 asking a practical question: can LLMs act as credible research partners in an open scientific field? The team used high-temperature superconductivity as the test case and evaluated six systems on expert-level questions, with results published in the Proceedings of the National Academy of Sciences.
The most important finding was not just which model won, but why. The top performers were NotebookLM and a custom retrieval-augmented generation system built on curated, quality-controlled sources. Systems with broader web access tended to mix established theories with more speculative claims, which reduced their usefulness for a field where unresolved debates and historical context matter.
How the study worked
- Experts assembled 15 review articles to provide a quality-controlled overview of the field.
- Those materials were expanded into roughly 1,726 curated sources for the closed systems.
- Web-connected systems had access to 765 open-access experimental papers and 1,553 open-access theoretical papers.
- A panel of experts wrote 67 questions and scored answers for balance, comprehensiveness, conciseness, and evidence quality.
The evaluated systems included GPT-4o, Perplexity, Claude 3.5, Gemini Advanced Pro 1.5, NotebookLM, and a custom RAG system. Google said NotebookLM performed especially well because it grounded answers in a constrained library of sources, producing responses that experts judged to be more balanced and better referenced. At the same time, the paper noted clear limitations across all systems, especially around temporal understanding, table and figure interpretation, and the tendency to miss relevant literature when phrasing changed.
Why it matters
This result is highly relevant for scientific AI products. It suggests that trustworthy performance in complex domains may depend less on unrestricted web access and more on carefully curated corpora, citation discipline, and retrieval design. In other words, the winning stack for scientific reasoning may be model plus knowledge architecture, not model alone.
For Insights readers, that is the bigger signal. As AI moves deeper into biology, materials science, physics, and medicine, product teams may need to invest as much in evidence workflows and expert source curation as they do in frontier-model selection.
Related Articles
On Feb. 12, 2026, Google announced a major Gemini 3 Deep Think upgrade for science, research, and engineering. The new version is available in the Gemini app for Google AI Ultra subscribers and, for the first time, via early API access for researchers, engineers, and enterprises.
Google on Mar 12, 2026 introduced Groundsource, a Gemini-powered method for turning public reports into historical disaster data. The company says the system identified more than 2.6 million flood events across over 150 countries and now supports urban flash-flood forecasts up to 24 hours in advance.
On March 12, 2026, Google Research said it is expanding Flood Hub with urban flash-flood predictions that can give up to 24 hours of advance notice. The company says it trained the model with a Groundsource dataset built by using Gemini to extract past flood-event details from public news reports.
Comments (0)
No comments yet. Be the first to comment!