^{Eric Holloway
March 3, 2025

5

Large Language Models (LLMs)}

How the Father of Information Theory Invented Modern AI

_{In 1948 Claude Shannon used Andrey Markov’s 1906 process to formulate an appproach that enabled the development of chatbots (large language models)} _{Eric Holloway
March 3, 2025

5

Large Language Models (LLMs)}

Share: Facebook; Twitter/X; LinkedIn; Flipboard; Print; Email

The large language models (LLMs) that power the modern generative AI revolution seem innovative and groundbreaking. But what if I told you that the fundamental idea behind them is older than the computer itself?

The fundamental idea behind them is known as a Markov process. It was identified by Russian mathematician Andrey Markov (1856–1922) in 1906. In a Markov process, the next state depends only on the immediately previous one. The physical world is a Markov process where each instant proceeds from the immediate prior instant.

Claude Shannon and the Markov process

In his influential 1948 paper, “A Mathematical Theory of Communication,” Claude Shannon (1916–2001) introduced the mathematical fundamentals of digital communication. He also described a rudimentary language generating system based on a Markov process. He intended his paper to show that the information content of such a system could be measured. Thus we could measure how much information could be transmitted over a noisy communication channel with minimal loss.

Shannon’s system for generating language is based on a lookup table of English words. It is built by calculating the probability that a word follows a specific sequence of other words. When presented with a sequence of words, a probability distribution over subsequent words would be generated. The next word is selected randomly using this probability distribution.

We see a simple version of this process on cell phone messaging systems. For example, if the sequence typed is “Yes, I got the report. I am tied up this morning. I will call you this… ,” the system may suggest “afternoon,” “evening,” or “week” as choices that it can autocomplete. Based on general usage, those are the most likely words to appear in that slot.

This technique is used by modern large language models. A sequence of word tokens is fed into the LLM, and a probability distribution over subsequent words is generated. The probability distribution is used to pick the next word token. This forms a Markov process. As Shannon put it,

“Stochastic processes of the type described above are known mathematically as discrete Markoff processes and have been extensively studied in the literature… To make this Markoff process into an information source we need only assume that a letter is produced for each transition from one state to another. The states will correspond to the ‘residue of influence’ from preceding letters.”

The only difference between Shannon’s model and LLMs today is that LLMs do not use an explicit lookup table. Instead, they use a compressed form of the lookup table in the form of a sophisticated neural network known as a transformer. However, despite the sophistication of the transformer neural network, Shannon’s lookup table approach would achieve the exact same result (although it is not practical to use).

The reason the two approaches are equivalent is that they are both Markov processes where the generation of every word is dependent only upon the immediately previous state, which is the context window.

A theory ahead of its time

Like his theory of communication, Shannon’s theory of language generation was way ahead of its time. In order to use Shannon’s original lookup table concept to achieve state-of-the-art LLM performance today, we’d need a table with a number of entries on the order of 200,000 tokens to the power of 100 million. That is the length of the current largest token window. Needless to say, such a table would not fit within our universe. Making Shannon’s idea practical required the breakthrough of modern neural network architectures.

However, Shannon anticipated such a breakthrough. He used the analogy of crossword puzzles to explain that the amount of redundancy in human language allows it to be compressed into a structure like the neural network. If there is too much redundancy and every sequence is valid, astronomically large lookup tables result. If there is too little redundancy, there is nothing to predict and no creativity. Like the perfect bowl of porridge in the folk tale of Goldilocks and the Three Bears, he reasoned that human language has just the right amount of redundancy to allow both a practically compressible structure and enough variety to keep things interesting:

The redundancy of a language is related to the existence of crossword puzzles. If the redundancy is zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters forms a crossword puzzle. If the redundancy is too high the language imposes too many constraints for large crossword puzzles to be possible. A more detailed analysis shows that if we assume the constraints imposed by the language are of a rather chaotic and random nature, large crossword puzzles are just possible when the redundancy is 50%. If the redundancy is 33%, three-dimensional crossword puzzles should be possible, etc.

Model collapse as a drift toward randomness

Because LLMs are Markov processes, they exhibit key properties of Markov processes. Significantly, they will tend towards the most probable states, which is similar to the second law of thermodynamics. This also explains why an LLM model degrades when it is trained on its own output (model collapse): Low probability states will be generated less and less frequently and eventually disappear from the LLM. At the same time, random transitions will come to dominate the output.

The end result is a samey, but also incoherent, mismash of words, similar to flipping a coin. Every sequence of heads and tails you get will seem similarly random to every other sequence, but none will match another.

So there you have it. A mathematical structure invented in the early 1900s, and applied to language in the middle of the century, is the basis of modern AI. In the AI world, just as in a Markov process, the more things change the more they stay the same.