AI and Human Text: Indistinct?Here's a mathematical proof that challenges the assumption that AI and human-made text are the same
What is a poor teacher to do? With AI everywhere, how can he reliably detect when his students are having ChatGPT write their papers for them? To address this concern, a number of AI text detector tools have emerged. But do they work?
A recent paper claims that AI generated text is ultimately indistinguishable from human generated text. They illustrate their claim with a couple experiments that fool AI text detectors by simple variations to AI generated text. Then, the authors go on to mathematically prove their big claim that it is ultimately impossible to tell AI text and human text apart. However, the authors make a crucial assumption.
The proof assumes that AI generated text will become closer and closer to human generated text until the two are the same. The proof then concludes it will be impossible to distinguish the AI text from human text once this happens. Obviously, the conclusion is baked into the premises with the key assumption that AI will eventually reach human level intelligence.
But what if we don’t make that assumption? Then it becomes easy to mathematically prove the opposite, that AI text is always ultimately distinguishable from human text.
To understand the forthcoming proof, we need a quick aside into computer science. Computer science consists of computer programs and mathematical results about programs. One of the crucial questions is whether a computer program can ever generate more information than is contained within the program. The surprising answer is no, and the proof is quite simple.
Programs generate sequences of ones and zeros, and for every sequence of ones and zeros there is a smallest possible program that generates that sequence. The length of this smallest possible program is called the Kolmogorov complexity of the sequence.
Generating New, Complex Information?
This brings us to the proof that programs can never generate more information than they contain. We can use the concept of Kolmogorov complexity to describe the information in a sequence as the amount of computer code required to generate the sequence. Similarly, the amount of information within a computer program is no more than the length of the program. Translate this back into the question of whether programs can generate more information than they contain, the question becomes whether a program can generate a sequence that has a greater Kolmogorov complexity than the program itself. Let’s assume the program can do so, and it outputs a sequence with greater Kolmogorov complexity than the program itself. If this happened, then recalling the definition of Kolmogorov complexity, this means the smallest possible program that can generate that sequence is larger than the program that generated the sequence. This, my dear readers, is a contradiction. Therefore, programs can never generate more information than they contain.
Circling back to the question of AI generated text and to what degree it can be distinguished from human generated text, let’s bring in our new friend Kolmogorov complexity. Based on the previous proof, we know that the AI generated text will never have greater Kolmogorov complexity than the program that generated it. On the other hand, humans are constantly writing new things, in particular writing new computer programs, and this suggests the Kolmogorov complexity in human writing is constantly increasing. So, this means that with every single AI system, regardless of any extra obfuscation programs, there is always a Kolmogorov complexity threshold that limits the output of the system. Once a piece of human writing surpasses this threshold, we know without a doubt it was not produced by the AI system in question. Furthermore, this also means we can take every single AI system, stuff them together, and that gives a single Kolmogorov complexity threshold that eliminates the possibility of any AI system generating a piece of text.
So, there you have it, a mathematical proof that AI and human text are always ultimately distinguishable. QED.