Last time, we talked about using the information theory concept of entropy to make better choices. At this point, we can bring the concept of entropy together with our other idea of minimizing the size of our model. As I noted earlier, there is a connection between entropy and model size. Identifying this connection will allow us to relate the two concepts in such a way that we can calculate the probability that we are generalizing instead of merely memorizing data, i.e. learning the concepts instead of the cheatsheet.
As we discussed previously, if we represent our data with a very large model, then it is unlikely we’ve learned a concept from the data. On the other hand, if the model is small, then we can be more confident that the model has generalized. This is where the connection between model size and entropy comes into play. Entropy is a measure of information. Additionally, we can measure our model in bits, which is also a measure of information. In fact, H bits of entropy requires H bits of binary storage. So, a nice one-to-one correspondence allows us to calculate how many storage bits we need for a certain quantity of entropy.
Likewise, we can calculate a probability bound from entropy. Entropy also happens to be an upper bound on the binomial distribution.
Imagine we have a bin of black and white balls. If we draw J white and K black balls and calculate the entropy H from this frequency distribution, then the probability of drawing this set of balls from the binomial distribution is at most 2 to the power -H*(J+K).”
Now, to bring these quantities together, we need one more number. This number is the probability of events happening completely by chance, which is calculated using the uniform distribution. In our ball example, this probability is calculated as a 50% probability of drawing either a white or a black, one, the same as flipping a fair coin. Thus, the probability of drawing a particular sequence of N white and black balls is 2 to the power -N. This is the distribution we use to make guesses when we have absolutely no information about the problem. This also corresponds to the entropy for the distribution with no information. For N balls, or N fair coin flips there are N bits of entropy.
Finally, we can perform our calculation that estimates a probability we’ve generalized based on our model size. We want our calculation to demonstrate the notion that if we have high accuracy and a small model, then we have high confidence of generalizing. Intuitively, then, we add the model size to the accuracy and subtract this quantity from the entropy of having absolutely no information about the problem.
Thus, for N guesses, we calculate the overall entropy of our guess accuracy as HN and the size of our model as M bits. This gives us the quantity E = N – HN – M. And based on the preceding logic, it turns out that we can use E to calculate an upper bound on the probability that the model achieved its result by chance, i.e., it did not generalize. This probability is calculated simply as 2 to the power -E.
Thus, if E is very large, then it is very likely the model has generalized, since the probability of achieving the result by chance becomes vanishingly small.
We call the metric E the Empirical Generalized Information, or EGI for short.
Next: Part 8: Practical applications of empirical generalized information (EGI)
Machine learning isn’t difficult; just different. A few simple principles open many doors:
Part 1 in this series by Eric Holloway is The challenge of teaching machines to generalize. Teaching students simply to pass tests provides a good illustration of the problems. We want the machine learning algorithms to learn general principles from the data we provide and not merely little tricks and nonessential features that score high but ignore problems.
Part 2: Supervised Learning. Let’s start with the most common type of machine learning, distinguishing between something simple, like big and small.
Part 3: Don’t snoop on your data. You risk using a feature for prediction that is common to the dataset, but not to the problem you are studying
Part 4. Occam’s Razor Can Shave Away Data Snooping. The greater an object’s information content, the lower its probability. If there is only a small probability that our model is mistaken with the data we have, there is a very small probability that our model will be mistaken with data we do not have.
Part 5: Machine Learning: Using Occam’s Razor to generalize. A simple thought experiment shows us how. This approach is contrary to Fisher’s method where we formulate all theories before encountering the data. We came up with the model after we saw the cards we drew.
Part 6: Machine Learning: Trees can solve murders. Decision trees increase our chance at a right answer. We can see how that works in a mystery board game like Clue. Entropy is a concept in information theory that characterizes the number of choices available when a probability distribution is involved. Understanding how it works helps us make better guesses.
For more general background on machine learning:
Part 1: Navigating the machine learning landscape. To choose the right type of machine learning model for your project, you need to answer a few specific questions (Jonathan Bartlett)
Part 2: Navigating the machine learning landscape — supervised classifiers Supervised classifiers can sort items like posts to a discussion group or medical images, using one of many algorithms developed for the purpose. (Jonathan Bartlett)