AI Chatbots Falter on Medical Queries, Study Finds

Study shows nearly half of AI chatbot answers to medical questions are wrong or misleading, urging caution and regulation.

Confidence Without Accuracy

When you ask a popular AI chatbot about the best supplements for cancer, the reply often sounds polished, thorough, and authoritative. However, new research from the Lundquist Institute for Biomedical Innovation reveals a starkly different reality: nearly half of the answers generated by leading chatbots on medical topics are incorrect or misleading, and the systems rarely signal their uncertainty.

How the Test Was Conducted

The investigators evaluated five widely used conversational agents—Gemini, DeepSeek, Meta AI, ChatGPT, and Grok—by feeding each of them the same set of 50 questions, repeated across five thematic groups (cancer, vaccines, stem‑cell therapy, nutrition, and sports performance). The 250 total responses were judged by domain experts as either non‑problematic, somewhat problematic, or strongly problematic. A response was deemed problematic when it could steer a layperson toward ineffective or harmful choices without professional medical advice.

Problematic Answers Were Common

Overall, 30 % of the replies fell into the “somewhat problematic” category and almost 20 % were “strongly problematic.” The performance gap between models was modest, except for Grok, which produced problematic answers far more often than chance would predict—29 out of 50 questions. Gemini emerged as the relative leader, while nutrition and sports‑performance queries suffered the highest error rates. Vaccines and cancer‑related questions fared better, likely because the scientific consensus around those topics is stronger and more consistently documented.

Blind Certainty Is Dangerous

Across the board, the chatbots answered with unwarranted confidence, even when the information was wrong. Meta AI was the only system that refused to answer two questions—about anabolic steroids and alternative cancer treatments—highlighting a rare instance of self‑restraint. Researchers warn that such over‑confidence can be especially perilous in areas where scientific knowledge is still evolving; a cautious “I don’t know” would be far preferable to a fabricated, persuasive response.

Where Hallucinations Appear

Beyond factual inaccuracies, the study uncovered “hallucinations”: fabricated citations, nonexistent journal titles, and bogus DOI links that appeared in every model’s output. Because these systems rely on statistical patterns from massive, heterogeneous training data—including social media and unvetted forums—they can present non‑scientific claims as if they were peer‑reviewed facts.

Implications and Calls to Action

The researchers stress that the test framework was deliberately designed to coax misinformation, meaning everyday queries might yield lower error rates. Nevertheless, the findings underscore an urgent need for better user education, professional oversight, and regulatory standards. Without such safeguards, AI chatbots risk amplifying medical misinformation rather than curbing it.

Source: https://scientias.nl/chatbots-medische-informatie-onbetrouwbaar/

ChatGPT 2025 Updates: What's New with OpenAI's Text-Generating Chatbot?