Get the FREE DIGITAL BOOK: The Case for Killer Robots
Mind Matters Reporting on Natural and Artificial Intelligence
vitamin-d-gelcaps-stockpack-unsplash.jpg
Vitamin D gelcaps.
Photo by Michele Blackwell at Unsplash

Vitamin D and COVID-19: Is It Data or Noise?

Because random clusters occur naturally in large numbers, only randomized, controlled trials can tell us

Two groups of researchers recently reported a link between vitamin D and COVID-19 mortality — more vitamin D meant lower mortality. A Northwestern University researcher who reported this association advised that “it is clear that vitamin D deficiency is harmful, and it can be easily addressed with appropriate supplementation.”

This news spread faster than the virus itself among worried, sometimes desperate, people who grasped at the Vitamin D lifeline. Too often today news stories are clickbait. Sink the hook and reel ‘em in. Headlines are written not to report facts, but to attract attention:

“Vitamin D linked to low virus death rate(World Pharma News, May 8, 2020)

“People with low levels of Vitamin D may be more likely to catch coronavirus and die from COVID-19 infection, study suggests(Daily Mail, May 1, 2020)

“Clear link between vitamin D deficiency and severity of coronavirus(NUTRAgredients.com , April 28, 2020)

The studies themselves were less exciting than these headlines. They were not scientific experiments. There were no rigorous randomized trials with the health of subjects given Vitamin D compared to that of subjects given placebos. Instead, the researchers compared data from several countries.

No one knows the average vitamin D level in a country’s residents so the researchers used proxies that had been reported in other studies. These vitamin D estimates were then compared to the COVID-19 mortality rates in these countries and the comparison revealed a negative correlation. For example, residents of Italy and Spain have relatively high COVID-19 mortality rates and are thought to have relatively low levels of vitamin D.

One very big problem with such studies is the inevitability of chance patterns and correlations in large data bases. Even if COVID-19 deaths are randomly distributed among the population (and they surely aren’t), data mining will, more likely than not, discover a geographic cluster of victims. To demonstrate this, I created a fictitious city with 10,000 residents evenly spaced throughout the city, each having a one-in-a-hundred chance of a COVID-19 infection. I used a computer-based random number generator to determine which residents in this imaginary town would be the victims. The resulting COVID-19 map is shown below. Each black dot represents a victim. There are no COVID-19 victims in the white spaces.

There is clearly a COVID-19 cluster in the lower part of the map. It would easily be discovered by any half-decent data-mining program. If this were a real city, data-mining software could go on to ransack data, looking for something unusual about the neighborhood or the people living there.

Perhaps a city park is nearby. If we now compare COVID-19 rates for people who live near parks with rates for people who live far away, guess what? The COVID-19 rates are higher near parks, suggesting that people living near parks are more susceptible.

Figure 1 also shows a COVID-19 “fortress,” a part of town where nobody has the disease. There is surely something unusual in this COVID-19-free neighborhood that could be discovered by data-mining software or a Sunday drive. Perhaps the town’s water tower is nearby. If we now compare the COVID-19 rates for people who live near the water tower with COVID-19 rates for people who live far away, COVID-19 rates are lower near the water tower. That’s precisely why we chose this neighborhood. Because nobody there has COVID-19.

In each case, living near the parks and living near the water tower, we face the same problem. If we use the data to invent the theory, then of course the data support the theory! How could it be otherwise? Would we make up a theory that did not fit the data?

The same argument scales up to cities within a country or countries within the world. Some cities will inevitably have higher COVID-19 rates than others; so will some countries. So what? Because random data contain geographic clusters, the identification of clusters is not necessarily meaningful. The association, after the fact, of these clusters with some characteristics of the area or the people living in the area is not convincing scientific evidence of anything.

A compelling example of such spurious associations happened in the 1970s when an epidemiologist looked at the homes of people living in Denver, Colorado, who had died of cancer before the age of 19. Some cancer clusters were found. If these homes had been near soccer fields, she might have concluded that proximity to soccer fields causes cancer. (The grass? The fertilizer? The weed killer? The noise?) Instead, she found that many of the homes were near power lines, which set off nationwide fear, fueled by frightening stories.

A journalist named Paul Brodeur wrote three New Yorker articles that reported an unusually large number of cancer cases among people living near a power station in Guilford, Connecticut, and fifteen cancer cases among people working at a school near power lines in Fresno, California. Brodeur later put these articles (with minor changes) into a book titled The Great Power-Line Cover-Up: How the Utilities and Government Are Trying to Hide the Cancer Hazard Posed by Electromagnetic Fields (1993). He ominously warned that, “Thousands of unsuspecting children and adults will be stricken with cancer, and many of them will die unnecessarily early deaths, as a result of their exposure to power-line magnetic fields.”

Never mind that scientists know a lot about electromagnetic fields and that there is no plausible theory for how power line electromagnetic fields might cause cancer. The electromagnetic energy from power lines is far weaker than that from moonlight and the magnetic field is weaker than the Earth’s magnetic field. Experiments were done worldwide that supported the science and refuted the journalist. Weighing the theoretical arguments and empirical evidence, the National Academy of Sciences concluded that power lines were not a public health danger and that there was no need to fund further research, let alone tear down power lines. One of the nation’s top medical journals editorialized that we should stop wasting research resources on this question.

In 1999, The New Yorker published an article titled “The Cancer-Cluster Myth,” which implicitly repudiated its earlier articles by Paul Brodeur. That was an implicit retraction. Nonetheless, some people still think that power lines cause cancer. Once the toothpaste is out of the tube, it is hard to put it back in.

Will the reported link between vitamin D and COVID-19 end similarly? Nobody knows. What we do know is that citizens want simple explanations, even in situations like COVID-19 where the reality is extremely complicated and not very well understood. Age seems to matter, as do social distancing and mobility. So do the usual suspects: obesity, smoking, high blood pressure, heart disease, and diabetes. Perhaps race and diet matter. Timing certainly matters, as an area can change from no infections to devastation in a few weeks.

There are lots of models, but some are just extrapolations without explanations and others are just simulations of overly simplistic models. None explain the enormous differences between Ecuador and Colombia, Brazil and Argentina, Thailand and Vietnam, Iran and Iraq, let alone differences within countries. Why was the impact was so much worse in Italy’s Lombardy region than in neighboring Veneto? Why is the impact so different in Connecticut and Vermont? Why do poor countries tend to have relatively low risk while poor people living in wealthy countries have relative high risk?

Vitamin D is surely not the most important causal factor for COVID-19 mortality. Does it matter more than a trivial amount? We cannot tell from cluster studies.

Maybe vitamin D supplements could help a lot. Maybe they could help a little. Maybe they would do nothing at all. We won’t know until randomized controlled trials are done. Before you start binging on vitamin D, let’s see how the scientific tests now underway in France and Spain turn out. In the meantime, megadoses of vitamin D are not a good substitute for face masks and distancing.


Further reading:

Data mining: A plague, not a cure It is tempting to believe that patterns are unusual and their discovery meaningful; in large data sets, patterns are inevitable and generally meaningless. (Gary Smith)


Gary N. Smith

Gary N. Smith is the Fletcher Jones Professor of Economics at Pomona College. His research on financial markets statistical reasoning, and artificial intelligence, often involves stock market anomalies, statistical fallacies, and the misuse of data have been widely cited. He is the author of The AI Delusion, (Oxford, 2018)  and co-author (with Jay Cordes) of The 9 Pitfalls of Data Science (Oxford 2019), which won the Association of American Publishers 2020 Prose Award for “Popular Science & Popular Mathematics”

Vitamin D and COVID-19: Is It Data or Noise?