Get the FREE DIGITAL BOOK: The Case for Killer Robots
Mind Matters Reporting on Natural and Artificial Intelligence
Two female women medical doctors looking at x-rays in a hospital.
Two female women medical doctors looking at x-rays in a hospital
Radiology Adobe Stock licensed

Is AI really better than physicians at diagnosis?

The British Medical Journal found a serious problem with the studies

Amid the hype and hope for AI, one area has seemed to have some promise: The use of AI systems to aid and assist overworked medical staff. And, unlike many of the AI promises, the hope was backed by research assessing its quality.

Unfortunately, it appears that this too has been oversold.

IBM was first to test the waters of AI medicine. After Watson defeated its Jeopardy opponents in 2011, the company repurposed— after all, the Jeopardy prize money would hardly justify the work—and Watson became Watson Health. Like so many promising AI technologies, Watson Health worked well in the lab, but stumbled in the real world. Dr. Martin Kohn, who was the Chief Medical Scientist at IBM Research when Watson defeated Ken Jennings in Jeopardy, reflected later on the failure of Watson Health:

As for Kohn, who left IBM in 2014, he says the company fell into a common trap: ‘Merely proving that you have powerful technology is not sufficient,’ he says. ‘Prove to me that it will actually do something useful—that it will make my life better, and my patients’ lives better.’ Kohn says he’s been waiting to see peer-reviewed papers in the medical journals demonstrating that AI can improve patient outcomes and save health systems money. ‘To date there’s very little in the way of such publications,’ he says, ‘and none of consequence for Watson.’

Eliza Strickland, “How IBM Watson Overpromised and Underdelivered on AI Health Care” at IEEE Spectrum

Still, despite Watson’s shortcomings, hope remained for Deep Learning-based AI. Could it be trained to “see” things doctors might miss? Could it help offload some of their burden? Early research from Google, among others, was optimistic.

It is now evident that that hope too was misplaced. The problem researchers found when reviewing the supporting work was simply that nearly all the studies failed to meet the expected standards for research:

Many studies claiming that artificial intelligence is as good as (or better than) human experts at interpreting medical images are of poor quality and are arguably exaggerated, posing a risk for the safety of ‘millions of patients’ warn researchers in The BMJ today.

British Medical Journal, “Concerns over ‘exaggerated’ study claims of AI outperforming doctors” at ScienceDaily

The researchers examined 10 years’ worth of studies comparing “the performance of a deep learning algorithm in medical imaging with expert clinicians.” Of those studues, only two relied on randomized clinical trials while 81 depended on non-randomized trials.

What’s the difference? Randomized tests make data harder to manipulate because the researchers don’t know what data they will be assigned. That makes data harder to manipulate, consciously or otherwise, to favor an outcome. Even better, double-blind tests—where, for example, neither the doctor nor the patient know which drug is real and which is a placebo—further reduce the likelihood of bias helping determine the result. These are the standards we should apply to all AI-based medical devices, even if they simply assist medical personnel. Otherwise, without anyone meaning to be dishonest, the fox is guarding the henhouse. The researchers, after years of investment and toil, want to believe that their devices work and that they will really help people:

Nevertheless, they say that at present, ‘many arguably exaggerated claims exist about equivalence with (or superiority over) clinicians, which presents a potential risk for patient safety and population health at the societal level.’

British Medical Journal, “Concerns over ‘exaggerated’ study claims of AI outperforming doctors” at ScienceDaily

As one writer summarizes, “Just as doctors take the Hippocratic Oath, we need to ensure that AI will do no harm before it becomes an integral part of medicine.” (Bryan Walsh, Axios)

That is good advice, even if it dashes someone’s hopes.

Further reading on issues in AI and medicine:

How AI can make medicine better—or not. Experts offer some real-world cautions about powerful new AI tools. (Denyse O’Leary)

Why was IBM Watson a flop in medicine? Robert J. Marks and Gary S. Smith (author of The AI Delusion) discuss how the AI couldn’t identify which information in the tsunami of medical literature actually mattered.

Can AI combat misleading medical research? No, because AI doesn’t address the “Texas Sharpshooter Fallacies” that produce the bad data. Gary S. Smith and Robert J. Marks discuss the question.

AI can help spot cancers but it’s no magic wand. When I spoke last month about how AI can help with cancer diagnoses, I failed to appreciate some of the complexities of medical diagnosis. (Brendan Dixon)


Brendan Dixon

Fellow, Walter Bradley Center for Natural & Artificial Intelligence
Brendan Dixon is a Software Architect with experience designing, creating, and managing projects of all sizes. His first foray into Artificial Intelligence was in the 1980s when he built an Expert System to assist in the diagnosis of software problems at IBM. Since then, he’s worked both as a Principal Engineer and Development Manager for industry leaders, such as Microsoft and Amazon, and numerous start-ups. While he spent most of that time other types of software, he’s remained engaged and interested in Artificial Intelligence.

Is AI really better than physicians at diagnosis?