Google tests AMIE in real outpatient care and reports zero safety stops

Original: Exploring the feasibility of conversational diagnostic AI in a real-world clinical study View original →

Read in other languages: 한국어日本語
Sciences Mar 27, 2026 By Insights AI 2 min read 1 views Source

From benchmark to clinic

On March 11, 2026, Google Research and Google DeepMind published a prospective real-world feasibility study of conversational diagnostic AI called AMIE. The work was done with Beth Israel Deaconess Medical Center and aimed to test whether a diagnostic assistant that had looked promising in simulated evaluations could operate safely and usefully in actual ambulatory primary care.

The study was pre-registered, IRB-approved, and conducted at a single center. One hundred adult patients completed an AMIE interaction before seeing a physician, and 98 later attended their scheduled appointment. Google says a human AI supervisor was available to intervene according to four predefined safety criteria, but no safety stop was triggered during the study.

What the results show

Google reports that AMIE performed on par with primary care physicians on the quality of the overall management plan and on differential-diagnosis quality. Primary care physicians still outperformed AMIE on practicality and cost-effectiveness of management plans, which is an important reminder that real care delivery includes operational judgment, not only diagnostic reasoning.

AMIE’s differential diagnosis included the final physician diagnosis in 90% of cases and reached 75% top-3 accuracy. Google also says patient trust in the AI system increased after the interaction and remained elevated at follow-up. Those signals suggest that conversational diagnostic systems may be clinically useful as intake and decision-support tools, especially when they help structure information before a visit.

  • Scale of test: 100 completed patient interactions, 98 subsequent appointments.
  • Safety monitoring: no intervention by the human AI supervisor was required.
  • Performance nuance: parity in some diagnostic measures, but physicians remained better on practicality and cost.

Google is careful not to overclaim. The company notes that this was a feasibility study, not a controlled proof of clinical efficacy. The system was text-based, run at a single center, and should not yet be read as a replacement for physician workflow. Even so, the study is notable because it moves diagnostic AI evaluation out of synthetic benchmarks and into real care settings, which is the harder test for any medical AI system.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.