Mind Matters Reporting on Natural and Artificial Intelligence
Female doctor analyzing liquid in test tube

The Decline Effect: Why Most Medical Treatments Are Disappointing

Artificial intelligence is not an answer to this problem, as Walmart may soon discover

A prominent 17th century Englishman with the odd English name of Sir Kenelm Digby (1603–1665, pictured) was a founding member of the Royal Society, established in 1660 to support and promote science. Yet he wrote a book that went through 29 editions, recommending that wounds be treated with a “Powder of Sympathy”: take six or eight ounces of Roman vitriol [copper sulphate], beat it very small in a mortar, sift it through a fine sieve when the sun enters Leo, keep it in the heat of the sun and dry by night. The oddest thing about this potion is not that it must be made when the sun enters Leo but that the salve was not to be applied to the wound but to the weapon that caused the wound. Explicitly, the person who had been cut by a knife should rub the powder on the knife.

When the Powder of Sympathy was applied to weapons, wounds did sometimes heal — not because of the powder, but because of the body’s natural recuperative abilities — which gave credence to the power of the Powder.

Unfortunately, much present-day medical research is similarly flawed. For example, some healthy patients are misdiagnosed as sick and, when they are given a worthless treatment for a nonexistent illness, observers conclude that the treatment worked. A second problem is that some people who are actually ill improve as their bodies fight off what ails them, with the result that worthless treatments may again be given credit when no credit is due.

These misinterpretations are examples of the fallacy known as post hoc ergo propter hoc: “after this, therefore because of this”. The fact that one event happens shortly before another doesn’t mean that the first event caused the second event to happen. For example, ancient Egyptians noticed that the annual flooding of the Nile was regularly preceded by the sight of Sirius — the brightest star visible from earth — appearing to rise in the eastern horizon just before sunrise. Sirius did not cause the flooding but it was a useful predictor because there was an underlying time connection between the two events: Every year, Sirius rose before dawn in mid-July and heavy rains, beginning in May in the Ethiopian Highlands, also caused the flooding of the Nile to begin in late July.

Was Donald Trump “cured” of COVID-19 because of remdesivir, dexamethasone, famotidine, a Regeneron antibody cocktail, or a double cheeseburger? It would be a post-hoc fallacy to credit any of these, or anything else that Trump did or had done to him, for the cure.

The gold standard for medical research is a randomized trial in which subjects are separated into a treatment group that receives the medication and a control group that does not. However, even the gold standard is not guaranteed to give the correct answer. Because the separation into treatment and control groups is random, it may happen by chance that the treatment group includes a disproportionate number of patients who are less severely ill or who are more likely to recover by themselves — leading to a misleading conclusion that a treatment under study works.

Such disappointments are magnified by the reality that the number of worthless treatments that might be tried is far larger than the number of genuinely useful treatments. To see what difference that reality makes, let’s assume a simple situation where 1) the medical treatments either work or don’t work and 2) a randomized trial will correctly identify 99 percent of all effective treatments as effective and 99 percent of all ineffective treatments as ineffective. Bayes’ rule implies that if only 1 out of every 100 tested treatments is genuinely useful, the probability that a treatment that is certified effective actually works is 50-50. For a more pessimistic outcome, suppose that only 1 out of every 1,000 tested treatments is actually effective. If so, then 91 percent of all certified-effective treatments are worthless.

Adding to the noise is the temptation to data mine — to ransack data looking for encouraging correlations. This is how one study came to the erroneous conclusion that the chances of pancreatic cancer could be reduced by avoiding coffee and another flawed study concluded that people could be cured of some diseases by distant healing.

Given the prevalence of random variation, post-hoc fallacies, and data mining, it is no surprise that many “proven” treatments are disappointing when they are released to the general public. The pattern is so common in medical research that it has a name — the “decline effect.”

varios tubos para muestras de sangre sobre informe medico

Another reason why benefits may be exaggerated is that medical trials focus on patients who are known to have a specific illness. Outside the trials and inside doctors’ offices, treatments are often prescribed for patients who have different illnesses, a combination of illnesses, or have the symptoms but not the illness. For example, antibiotics are widely viewed as a miracle drug and they are often very effective. However, some doctors seem to prescribe antibiotics reflexively despite possible side effects that include allergic reactions, vomiting, or diarrhea.

For childhood ear infections, the American Academy of Pediatrics now recommends that, instead of prescribing antibiotics, parents and doctors wait and watch to see if the body can fight the infection off unaided. More generally, The ICU Book, the best-selling and widely respected guidebook for intensive care units, advises: “The first rule of antibiotics is try not to use them, and the second rule is try not to use too many of them.”

The decline effect is likely to be even more pervasive when artificial intelligence (AI) algorithms are involved because computers are so good at finding correlations and so bad at judging whether the discovered patterns make sense. Thus, algorithms are likely to give flawed recommendations based on fleeting coincidences.

A team of physicians and biostatisticians at the Mayo Clinic and Harvard has just reported the results of a survey of clinical staff who were using AI-based clinical decision support (CDS) to improve glycemic control in patients with diabetes. When asked to rate the success of the AI-based CDS system in improving users’ ability to identify the correct intervention for individual patients on a scale of 0 (not helpful at all) to 100 (extremely helpful), the median score was 11. Only 14 percent of the users said that they would recommend the AI-based CDS to another clinic.

The most common complaints were that the suggested interventions were too similar for different patients and therefore were not sufficiently tailored to each patient, the suggested interventions were inappropriate or not useful, and the system inaccurately classified standard-risk patients as high-risk (a high false-positive rate).

IBM’s Watson has been a particularly notable disappointment in medical AI in that, so far, it has overpromised and underdelivered. Winning at Jeopardy and diagnosing cancer are very different tasks.

Walmart is now planning a major expansion into health care, based on Clover Health, which boasts that its Clover Assistant technology “gives your primary care doctor a complete view of your overall health and sends them care recommendations that are personalized for you, right when you’re in the appointment,” using “data, clinical guidelines, and machine learning to surface the most urgent patient needs.” That sounds a lot more like advertising copy written by a marketing executive than reliable advice written by doctors. If IBM’s Watson struggles with diagnoses and treatments, Walmart shoppers may want to be wary of the bargain healthcare aisle.

There have been many great successes in medical science: the HIV triple drug combination that has transformed HIV from a death sentence to a chronic condition, the benefit of statins, the effectiveness of antibiotics, and the treatment of diabetes with insulin. There have also been many successes in identifying the causes of diseases: asbestos can lead to mesothelioma, benzene can cause cancer, and smoking is associated with cancer.

Even though medical research has enjoyed such wonderful successes as these, blind faith is not warranted. Researchers should bear in mind the reasons for the decline effect and doctors and patients should anticipate the decline effect when deciding on treatments.

You may also find of interest:

Why was IBM Watson a flop in medicine? Robert J. Marks and Gary Smith discuss how the AI couldn’t identify which information in the tsunami of medical literature actually mattered.


Can AI combat misleading medical research? No, because AI doesn’t address the “Texas Sharpshooter Fallacies” that produce the bad data. (Gary Smith)

Gary N. Smith

Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of The AI Delusion, (Oxford, 2018)  and co-author (with Jay Cordes) of The 9 Pitfalls of Data Science (Oxford 2019), which won the Association of American Publishers 2020 Prose Award for “Popular Science & Popular Mathematics”

The Decline Effect: Why Most Medical Treatments Are Disappointing