Step Away From Stepwise Regression (and Other Data Mining)Stepwise regression, which is making a comeback, is just another form of HARKing — Hypothesizing After the Results are Known
There is a strong correlation between the number of lawyers in Nevada and the number of people who died after tripping over their own two feet. There are similarly impressive correlations between U.S. crude oil imports and the per capita consumption of chicken — and the number of letters in the winning word in the Scripps National Spelling Bee and the number if people killed by venomous spiders. If you find these amusing (as I do), there are many more at the website Spurious Correlations.
These silly statistical relationships are intended to demonstrate that correlation is not causation. But no matter how often or how loudly statisticians shout that warning, many people do not hear it.
When there is a correlation between variables A and B, it could be that:
● A causes B — or that B causes A. A 2018 survey found that Millennials living with their parents had less income than did Millennials living on their own: “Contrary to received wisdom, it may be smarter for Millennials to fly the nest sooner than later.” Perhaps in many cases, it was not that Millennials had low income because they lived at home but, instead, lived at home because they had low income.
● Other factors may cause both A and B. Several years ago, a strong correlation was noticed between the annual number of marriages and beer sales in the United States. Does drinking lead to marriage? Does marriage lead to drinking? A more likely explanation is that population growth causes many things (including marriages and beer consumption) to increase over time. It is a just a fleeting coincidence, like the correlation between crude oil imports and the per capita consumption of chicken.
Our distant ancestors benefitted from noticing that elephants could lead them to water and that wildebeest stampedes might warn them of predators. The best pattern spotters were more likely to survive and reproduce, passing on their pattern-recognition skills to future generations. The problem today is that instead of observing elephants trudging toward water holes and wildebeest fleeing lions, we are inundated by complex data for stock prices, election polls, tweeted words, and pretty much anything that can be measured. We are hard-wired to look for patterns and tempted to believe that the patterns we discover are meaningful. But the data deluge has made the number of bogus coincidences waiting to be discovered so large, relative to the number of real causal relationships, that it is almost certain that a randomly discovered pattern is just a statistical coincidence.
The search for meaningful coincidences can infect science too
Even scientists are sometimes seduced by the idea that correlation supersedes causation. The scientific method requires hypotheses to be tested empirically. With the data deluge and powerful computers, it is tempting to data mine by rummaging though data looking for correlations and other statistical patterns, unencumbered by preconceived theories. This is known as Hypothesizing After the Results are Known (HARKing), with the harsh sound of the word intended to convey the message that it is bad statistical practice. Yet, many now consider HARKing to be a virtue, not a vice.
Suppose that a researcher, trying to predict whether a person who has been infected by COVID will die, has data for hundreds of variables. Instead of choosing a relatively small number of logically relevant variables, the researcher lets a computer algorithm try a large number of possible combinations of a large number of explanatory variables. Even with modern computers, the number of calculations can quickly become overwhelming. An attempt to find the best 10 variables out of 100 candidate variables requires a consideration of 17.3 trillion possible combinations. With 1000 possible explanatory variables, there are 2.6 × 1023 combinations; with one million candidate variables, the number of possibilities grows to 2.7 ×1053 .
Instead of trying all possible combinations, many researchers use a technique called stepwise regression that was first proposed more than 70 years ago. The stepwise procedure starts with no explanatory variables and then adds variables, one by one, based on which variable is the most statistically significant, until there are no remaining statistically significant variables.
This is straight-ahead HARKing. The problem is that the procedure discovers correlations without fretting about causation and is very likely to discover worthless coincidences instead of useful predictors. This is, of course, the fundamental problem with all AI and machine learning algorithms, which are ruthlessly efficient at finding statistical patterns but utterly incapable of judging whether these patterns are meaningful or meaningless. Nonetheless, far too many researchers believe that correlation is enough.
The pitfalls of stepwise regression
I used a series of Monte Carlo simulations to demonstrate the dangers of stepwise regression (and, implicitly, other data mining techniques). In each simulation, I generated values for up to 1000 candidate explanatory variables and then used 5 randomly selected “true” variables” to determine the values of the dependent variable that is to be predicted. The other candidate explanatory variables are nuisance variables that do not affect the dependent variable but might be coincidentally correlated with it.
The research question is how effective stepwise regression is at identifying the true explanatory variables so that reliable predictions can be made with fresh data. I found that, even with only 100 candidate variables, it is more likely than not that a variable chosen by the stepwise procedure is a nuisance variable, rather than a true variable. As the number of candidate variables increases, the chance that the true variables will be overlooked quickly approaches 1.
A Danish proverb says that “It is difficult to make predictions, especially about the future.” The stepwise models’ predictions were consistently impressive for the data used to estimate the models but disappointing when applied to fresh data. The models did best when only a small number of nuisance variables was considered; for example, 5 true variables and 5 nuisance variables rather than 5 true variables and hundreds of nuisance variables. It is better to choose variables wisely than to HARK your way to nonsense.
Years ago, stepwise regression was so thoroughly discredited that a psychology journal announced that authors should not bother submitting papers using stepwise regression. Now, stepwise regression is back, more popular than ever, because the data deluge makes it impractical for data miners to try all possible combinations of variables.
Wouldn’t you know it, a few weeks ago a young professor with a PhD from Harvard gave a seminar at my college and used a stepwise procedure to deal with a very large data set. He spent more time talking about the math involved in his procedure than about the question of whether he had a useful model or spurious correlations—which is really the only question that matters.
BTW: Did you know that there is a strong correlation between the per capita consumption of margarine and the divorce rate in Maine?
You may also wish to read: Yellow fingers do not cause lung cancer. Neurosurgeon Michael Egnor and computer engineer Bob Marks look at the ways Big Data can mislead us into mistaking incidental events for causes. Distinguishing between causes of an event and incidental outcomes of the event is essential for smart thinking.