Mind Matters Natural and Artificial Intelligence News and Analysis
ai-replacing-doctors-the-rise-of-robo-healers-an-ominous-gaz-748326595-stockpack-adobestock
AI Replacing Doctors: The Rise of Robo-Healers - An Ominous Gaze into a Tomorrow Where Machines Don White Coats, Administering Precision Care and Revolutionizing the Medical Landscape.
Image Credit: ana - Adobe Stock

Is Microsoft’s New AI System Better Than Doctors? Probably Not.

A critic notes that Microsoft’s AI system didn’t solve the problems, but merely repeated the solutions that it was trained on
Share
Facebook
Twitter/X
LinkedIn
Flipboard
Print
Email

Microsoft’s new article, The Path to Medical Superintelligence, has been featured in The Guardian, Wired, Australian Financial Review, the Financial Times, and Newsweek. In the article, Microsoft claims its AI system can diagnose patients four times more accurately, with lower costs, than can doctors. But is this true, or is it too good to be true?

At least one doctor, Dr. Dominic Ng (@DrDominicNg), has criticized Microsoft’s tests on X. In his view, that Microsoft has rigged the test so that its AI would perform better than human doctors. Dr Arvind Narayan, professor of computer science at Princeton University reposted Dr. Ng’s post, which is when I noticed it.

What are the problems?

Ng describes several problems with Microsoft methodology including: 1) researchers tested the system on solved problems; 2) they didn’t test real patients; and 3) they didn’t let the doctors with whom the AI was compared consult the internet, colleagues, or books.

The first problem is the biggest one because it suggests that Microsoft’s AI system didn’t solve the problems, but merely repeated the solutions that it was trained on. As Dr Ng says: “These cases were already SOLVED and published. Real medicine involves genuine uncertainty — sometimes the diagnosis is never found. Does the AI know when to stop investigating?”

It would be like a school giving students the exam questions before the test, and letting them memorize both the questions and answers, a job that is trivial for a computer.

What did Microsoft say it did?

According to the firm, it

created interactive case challenges drawn from the New England Journal of Medicine (NEJM) case series — what we call the Sequential Diagnosis Benchmark (SD Bench). This benchmark transforms 304 recent NEJM cases into stepwise diagnostic encounters where models – or human physicians – can iteratively ask questions and order tests. As new information becomes available, the model or clinician updates their reasoning, gradually narrowing toward a final diagnosis. This diagnosis can then be compared to the gold-standard outcome published in the NEJM.

The problem is that the case series was most likely part of the training for Microsoft’s AI system because these cases had been previously released to the public. As Microsoft says:

Each week, the New England Journal of Medicine (NEJM) — one of the world’s leading medical journals — publishes a Case Record of the Massachusetts General Hospital, presenting a patient’s care journey in a detailed, narrative format. These cases are among the most diagnostically complex and intellectually demanding in clinical medicine, often requiring multiple specialists and diagnostic tests to reach a definitive diagnosis.

Microsoft’s AI system most likely was trained on these cases, in addition to millions of other documents. A good test of AI vs humans would have to involve information that was not used to train the AI system. But Microsoft didn’t do this.

Benchmark contamination

This is the not first time an AI system was trained on the test, so to speak. In an article last March, The Atlantic called this problem “benchmark contamination.” It’s so common a problem that one industry newsletter concluded in October that “Benchmark Tests Are Meaningless.” Yet despite how common and widely known the contamination is, “AI companies keep citing these tests as the primary indicators of progress.” And Microsoft just did it again.

The Atlantic article stresses that

benchmark contamination is not necessarily intentional. Most benchmarks are published on the internet, and models are trained on large swaths of text harvested from the internet. Training data sets contain so much text, in fact, that finding and filtering out the benchmarks is extremely difficult.

Thus, while Microsoft doesn’t mention the benchmark contamination problem in its article, even if AI researchers tried to filter out the 304 case studies from the training set, it would be difficult to do so.

No multiple choice questions

One thing that Microsoft did mention was that multiple choice exams were not used to compare AI and doctors, which had been done in the past. As the article says,

in just three years, generative AI has advanced to the point of scoring near-perfect scores on the [United States Medical Licensing Examination] and similar exams. But these tests primarily rely on multiple-choice questions, which favor memorization over deep understanding. By reducing medicine to one-shot answers on multiple-choice questions, such benchmarks overstate the apparent competence of AI systems and obscure their limitations.

With respect to “similar exams,” Microsoft is referring to the main press releases from AI companies over the last two years, extolling AI’s ability to outscore humans on a large variety of tests on law, accounting, engineering, and science. But because the AI systems were trained on those exams, the results are completely meaningless. They don’t mean that AI systems understand the discipline, be it accounting or law, better than humans and the outcome certainly doesn’t mean that they can outperform humans in the workplace.

What should Microsoft have done with its most recent system? It should compare the performance of its AI system and doctors in their treatment of real patients, some who are healthy, some who are sick, and some who have unusual diseases. To be statistically meaningful, the tests would have to be done on a fairly large number of patients.

What do we really need from AI?

Even better, Microsoft should develop an AI system that helps doctors make decisions. Why begin with a goal of replacing doctors? Why not first develop a system that helps doctors, keep improving the system, and then try to replace doctors years in the future? Trying to understand the rationale of big monopolists such as Microsoft and Google is often a fool’s errand.

Unfortunately, the damage is already done, and Microsoft’s goals were achieved. Microsoft, along with other AI companies, have been hyping AI’s capabilities for years and the release of its new AI system continues this behavior. Undoubtedly, many investors will see Microsoft’s article, but they will not see the pushback from doctors and other experts in the healthcare system who recognize that the tests don’t prove anything.


Jeffrey Funk

Fellow, Walter Bradley Center for Natural and Artificial Intelligence
Jeffrey Funk is the winner of the NTT DoCoMo Mobile Science Award and the author of six books including his most recent one: Unicorns, Hype and Bubbles: A Guide to Spotting, Avoiding and Exploiting Investment Bubbles In Tech.
Enjoying our content?
Support the Walter Bradley Center for Natural and Artificial Intelligence and ensure that we can continue to produce high-quality and informative content on the benefits as well as the challenges raised by artificial intelligence (AI) in light of the enduring truth of human exceptionalism.

Is Microsoft’s New AI System Better Than Doctors? Probably Not.