Machine Learning, Part 3: Don’t Snoop on Your DataYou risk using a feature for prediction that is common to the dataset, but not to the problem you are studying
To make the technique for dividing data that we discussed in Part 2 work for us, we must divide it before we look at it. Otherwise, we risk the serious sin (in data science) of “data snooping.”
If we data snoop, and form our prediction model by looking at the whole dataset before dividing it, we can inadvertently use a feature for prediction that is common to the dataset, but not to the problem we are studying.
For example, suppose someone said, “I saw a rabbit-shaped cloud the same time they had a sale on sneakers, so I assume rabbit-shaped clouds can predict sneaker sales.” The dataset has an accidental pattern that happens to be true in that instance, but was merely a coincidence and will not help in the future.”
This means, in turn, that merely achieving a high level of accuracy in prediction on the test dataset is not a good enough indication that we’ve discovered a good model. If the high accuracy scores are due to “data snooping” then the model is still suspect and may perform badly on new data.
In our example of athletes from Part 2 , say that all the jockeys just happened to have weights that are even numbers and the basketball players all have weights that are odd numbers. We look at our data and notice this strange fact, and add it to our model—which achieves perfect accuracy! But it completely misses the height and weight thresholds. Our model then goes on to fail abysmally in a competition for guessing who athletes are. This is an example of the perils of data snooping.
So, how can we guard against data snooping? That is often very difficult. For example, as we gather our data in the first place, perhaps through science experiments, we almost inevitably will observe it to some extent. The founder of modern statistical thought, Ronald Fisher, proposed that we formulate all our hypotheses before any encounter with the data. With our even/odd weight scenario, it’s unlikely we would have by chance come up with the even/odd model, and would instead have satisfied ourselves with some sort of threshold search.
However, is Ronald Fisher’s idea even realistic? How can we even begin to come up with theories before looking at our data? We surely must know something about the situations we wish to observe. And, it seems that we, in fact, come up with legitimate theories all the time in everyday life even after observing events around us. There must be some way to derive theories after the fact when we have already observed at least some of the data.
Intelligent design theory (IDT) has proposed such a method, the explanatory filter. The key to the filter is that we derive our theories from a knowledge source that is independent of the data. This is, incidentally, not really different from Fisher’s method because proposing a theory before the fact makes it independent from the data. However, the explanatory filter is taking Fisher’s method and generalizing it by removing the unnecessary restriction that the theory is declared before the fact.
This means that as long as we can establish that our theories, hypotheses, and/or models are independent of the data, then we can trust that their predictive power will generalize beyond the data we have observed. And there are a variety of ways to guarantee independence besides Fisher’s before-the-fact method. Next, we will examine one method in particular: Occam’s Razor.
Machine learning isn’t difficult, just different. A few simple principles open many doors:
Part 1 in this series by Eric Holloway is The challenge of teaching machines to generalize. Teaching students simply to pass tests provides a good illustration of the problems. We want the machine learning algorithms to learn general principles from the data we provide and not merely little tricks and nonessential features that score high but ignore problems.
Part 2: Supervised Learning. Let’s start with the most common type of machine learning, distinguishing between something simple, like big and small.
For more general background on machine learning:
Part 1: Navigating the machine learning landscape. To choose the right type of machine learning model for your project, you need to answer a few specific questions (Jonathan Bartlett)
Part 2: Navigating the machine learning landscape — supervised classifiers: Supervised classifiers can sort items like posts to a discussion group or medical images, using one of many algorithms developed for the purpose. (Jonathan Bartlett)