On Thursday, November 2, COSM 2023 hosted a panel on “The Quintessential Limits and Possibilities of AI,” addressing one of the fundamental questions COSM seeks to investigate: “Is AI ‘generative’ or degenerative?” If these experts are right, AI might be doomed to eventually degenerate into nonsense.
George Montañez, Assistant Professor of Computer Science at Harvey Mudd College, opened the session by explaining how AI works. Modern AIs and their “large language models” (LLMs) are trained on huge sets of real-world data — namely text and images generated by humans. Panelist William Dembski, a mathematician and philosopher, pointed out that these LLMs “require a lot of input data and training” in order to work. For example, he notes that it took an immense amount of data, time, money, and collateral damage to humans, to train AI to recognize and reject pornography. Similarly, software engineer Walter Myers noted on the panel that ChatGPT had to train on millions of images of cats and dogs before it could reliably recognize them. In contrast, Montañez points out that a human child can see a few pictures of an animal and they’re immediately able to recognize that species for life.
Montañez further explained that after enough training, AI can interpret data “beyond the things it’s seeing”—but this is only done due to “biases and assumptions” provided by humans who program AI with these capabilities. This means that “human fingerprints are all over” the capabilities of AI, and “as impressive as these systems are,” they are “highly parasitic on human rationality and creativity.” Montañez gave the example of an AI that remixes rap with Shakespeare. You “might think it’s amazing” but the reality is “it’s all based upon human programming,” he explained.
But there’s a pitfall to training AI on large datasets — something Denyse O’Leary recently wrote about — called “model collapse.” In short, AI works because humans are real creative beings, and AIs are built using gigantic amounts of diverse and creative datasets made by humans on which they can train and start to think and reason like a human. Until now, this has been possible because human beings have created almost everything we see on the Internet. As AIs scour the entire internet, they can trust that virtually everything they find was originally made by intelligent and creative beings (i.e., humans). Train AI on that stuff, and it begins to appear intelligent and creative (even if it really isn’t).
But what will happen as humans become more reliant on AI, and more and more of the Internet becomes populated with AI-generated material? If AI continues to train on whatever it finds on the Internet, but the web is increasingly an AI-generated landscape, then AI will end up training on itself. We know what happens when AIs train on themselves rather than the products of real intelligent humans — and it isn’t pretty. This is model collapse.
Enter Robert Marks, Distinguished Professor of Electrical and Computer Engineering at Baylor University. He noted that on the first day of COSM ’23, computer scientist and AI pioneer Stephen Wolfram warned that we’re at the edge of available training data for AI — essentially we’re hitting the limits of what we can feed AI to make it smart. Once AI runs out of training data, what will it do — train itself?
After taking the audience through a brief history of computing and the development of AI, Marks noted that “each jump [in computing ability] was done by humans, not AI. Each jump in AI happened due to human ingenuity.” But when AI runs out of human ingenuity to train on, will it itself hit a limit — i.e., model collapse? As Montañez put it, “After we’ve scraped the web of all human training data” then “it starts to scrape AI-generated data” because “that’s all you have.” That’s when you get model collapse, and we might be getting close to it.
Marks cited a newly posted study at Arxiv.org, “The Curse of Recursion: Training on Generated Data Makes Models Forget,” which shows model collapse in action. An initial “generation” of AI is trained directly on human-created data and its output generally makes sense. But after multiple generations of AI training on itself, the result is gibberish that’s obsessed with nonsensically colored jackrabbits:
In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-
A similar phenomenon happens with images. Marks showed how an AI trained to creatively make variations of the Mona Lisa painting initially provides some interesting if not disturbing images. But eventually, as it trains on its own material, you end up not with art but non-sensical lines and smudges.
The problem of model collapse is not entirely unlike human genetics, where siblings or cousins are warned never to marry because they both carry the same deleterious mutations that, when combined, will yield malformed offspring. Better to marry someone outside your immediate gene pool, because they will likely have “fresh genetic material” that can be combined with yours to make healthy children.
In a similar way, AI training on itself needs fresh creative material on which to train or else the algorithms will end up feeding off themselves in recursive cycles that degenerate into nonsense. As Popular Mechanics put it recently, AI will end up “eating its own tail.”
To mix metaphors, the threat of model collapse is akin to digital inbreeding, and it guarantees that without humans constantly providing fresh creative material for AIs to train on, AIs are doomed to deteriorate. Their creativity may therefore be limited by the human datasets they’re given, meaning there are basic limits to what AI can do. AI will never surpass humans in fundamental ways, and will always be limited by what they can learn from us.