Artificial Intelligence Still Can’t Think Like a Doctor
Recent experiments from the research institute Mass General Brigham have raised fresh doubts about the hype surrounding AI‑driven medical consultations. The study evaluated 21 large language models—including well‑known names such as ChatGPT, Gemini and Claude—by feeding them 29 realistic patient cases. When the models received a complete dossier containing symptom descriptions, lab results and scan images, they produced the correct final diagnosis in more than ninety percent of the scenarios. The numbers look impressive, but the deeper analysis reveals a critical weakness: the AI systems stumble on the intermediate reasoning steps that are the heart of clinical practice.
Diagnosing Beyond the End Point
Human physicians spend most of their time generating a differential diagnosis, a prioritized list of possible conditions that could explain the presenting complaints. This process involves weighing uncertain clues, ordering appropriate tests, and revising hypotheses as new information arrives. The new evaluation method, named PrIME‑LLM, measures every stage of that mental workflow—from initial suspicion to the final treatment recommendation. While some models scored well on the final answer, they consistently fell short when required to suggest which tests to order or to articulate plausible alternative explanations.
Why the Gap Exists
According to lead researcher Marc Succi, standard AI models “are still bad at clinical reasoning.” The technology excels when all puzzle pieces are laid out in front of it, but it struggles to extract the pieces from an incomplete or ambiguous narrative. In real‑world clinics, patients rarely present a neatly compiled file; they usually start with vague symptoms and limited background information. When the researchers stripped away parts of the data, the accuracy of the models dropped sharply, exposing their heavy reliance on exhaustive input.
Progress, Yet Still Far From 100%
Even the most recent language models only reached about 78 % performance on the PrIME‑LLM scoring system, a figure that trails far behind seasoned doctors who spend years honing their diagnostic intuition. Newer AI versions do perform better than older ones, yet the gap in differential diagnosis remains sizable. The study emphasises that the ability to produce a single correct label does not equate to the skill of navigating the complex, iterative reasoning that medicine demands.
AI as a Tool, Not a Replacement
The authors stress that artificial intelligence can still serve as a valuable assistant—providing quick reference or suggesting possible investigations—but it cannot yet replace the nuanced judgment of a human clinician. “We want to separate hype from reality,” Succi says, underlining the importance of keeping a qualified physician in the loop. As the technology matures, it may gradually support doctors in data‑intensive tasks, yet the core of patient care—listening, hypothesising, and adapting—remains a uniquely human endeavour.