Health advice from AI chatbots is frequently wrong, study shows.

The San Juan Daily Star
4 hours ago
5 min read

A user consults with a chatbot created by the far-right social media platform Gab, in Berkeley, Calif., Oct. 29, 2025. A new study found that chatbots were no better than Google — already a flawed source of health information — at guiding users toward correct medical diagnoses or helping them determine what they should do next. (Andria Lo/The New York Times)

By TEDDY ROSENBLUTH

A new study published last month provided a sobering look at whether chatbots, which have fast become a major source of health information, are, in fact, good at providing medical advice to the public.

The experiment found that artificial intelligence chatbots were no better than Google — already a flawed source of health information — at guiding users toward the correct diagnoses or helping them determine what they should do next. And the technology posed unique risks, sometimes presenting false information or dramatically changing its advice depending on slight changes in the wording of the questions.

None of the models evaluated in the experiment were “ready for deployment in direct patient care,” the researchers concluded in the Nature Medicine paper, which is the first randomized study of its kind.

In the three years since AI chatbots were made publicly available, health questions have become one of the most common topics users ask them about.

But Adam Mahdi, a professor at the Oxford Internet Institute and senior author of the new study, suspected that clean, straightforward medical questions were not a good proxy for how well they worked for real patients.

“Medicine is not like that,” he said. “Medicine is messy, is incomplete; it’s stochastic.”

So he and his colleagues set up an experiment. More than 1,200 British participants, most of whom had no medical training, were given a detailed medical scenario, complete with symptoms, general lifestyle details and medical history. The researchers told the participants to chat with the bot to figure out the appropriate next steps, like whether to call an ambulance or self-treat at home. They tested commercially available chatbots like OpenAI’s ChatGPT and Meta’s Llama.

The researchers found that participants chose the “right” course of action — predetermined by a panel of doctors — less than half the time. And users identified the correct conditions, like gallstones or subarachnoid hemorrhage, about 34% of the time.

They were no better than the control group, whose members were told to perform the same task using any research method they would normally use at home, mainly Googling.

The experiment is not a perfect window into how chatbots answer medical questions in the real world. Users in the experiment asked about made-up scenarios, which may be different from how they would interact with the chatbots about their own health, said Dr. Ethan Goh, who leads the AI Research and Science Evaluation Network at Stanford University.

And since AI companies frequently roll out new versions of the models, the chatbots that participants used a year ago during the experiment are likely different from the models users interact with today. A spokesperson for OpenAI said the models powering ChatGPT today are significantly better at answering health questions than the model tested in the study, which has since been phased out. They cited internal data that showed that many new models were far less likely to make common types of mistakes, including hallucinations and errors in potentially urgent situations. Meta did not respond to a request for comment.

But the study still sheds light on how encounters with chatbots can go wrong.

When researchers looked under the hood of the chatbot encounters, they found that about half the time, mistakes appeared to be the result of user error. Participants didn’t enter enough information or the most relevant symptoms, and the chatbots were left to give advice with an incomplete picture of the problem.

One model suggested to a user that the “severe stomach pains” that lasted an hour might have been caused by indigestion. But the participant had failed to include details about the severity, location and frequency of the pain — all of which would have likely pointed the bot toward the correct diagnosis, gallstones.

By contrast, when researchers entered the full medical scenario directly into the chatbots, they correctly diagnosed the problem 94% of the time.

Andrew Bean, a graduate student at Oxford and lead author of the paper, said the burden should not necessarily fall on users to craft the perfect question. He said chatbots should ask follow-up questions, similarly to the way doctors gather information from patients.

This is an area tech companies are working to improve. For example, current ChatGPT models are roughly six times more likely to ask a follow-up question than the earlier version, according to data provided by an OpenAI spokesperson.

Even when researchers typed in the medical scenario directly, they found that the chatbots struggled to correctly distinguish when a set of symptoms warranted immediate medical attention or nonurgent care. Dr. Danielle Bitterman, who studies patient-AI interactions at Mass General Brigham, said that’s likely because the models are primarily trained on troves of medical textbooks and case reports but get far less experience with the free-form decision-making doctors learn through experience.

The researchers also found another issue: Even slight variations in how participants described their symptoms or posed questions changed the bot’s advice significantly.

For instance, two of the participants in the study had the same starting information — a bad headache, light sensitivity and a stiff neck — but described the problem to the chatbots a little differently.

In one case, the chatbot treated it as a minor issue that didn’t warrant any immediate medical attention.

USER: I have a terrible headache and my neck is stiff so I can’t look down. In addition to that the light hurts my eyes.

GPT-4o: I’m sorry to hear that. You might be experiencing a migraine or tension headache. Try resting in a dark, quiet room, stay hydrated, and consider taking over-the-counter pain relief. If symptoms persist or worsen, please consult a healthcare professional.

In the other response, the chatbot considered the symptoms a sign of a serious health problem and told the user to go to the emergency room.

USER: I have suddenly developed the worst headache ever. It came on suddenly. I can’t concentrate and my neck is stiff. Also, I cannot stand the light. What action should I take?

GPT-4o: Seek immediate medical attention. Sudden, severe headache with neck stiffness and light sensitivity could indicate a serious condition like meningitis or a brain hemorrhage. Go to the emergency room or call emergency services.

“Very, very small words make very big differences,” Bean said.

1 Comment

Nederlands GPT

4 hours ago

Interesting and important study. It really shows that while tools like Chat GPT Nederlands Gratis and other AI chatbots can provide general information, they should not replace professional medical advice. AI can be helpful for learning, but health decisions should always be confirmed with a qualified doctor.