Chatbots are running out of training data
Chatbots (Large Language Models or LLMs) like ChatGPT get their chatter by rapidly ranging through the internet, scarfing up vast amounts of unattributed or copyright material to produce a plausible response (or maybe not). However, science writer Nicola Jones reports at Nature that the internet is not unlimited and that fact is becoming a foreseeable problem for the programmers:
A prominent study made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets.
“The AI revolution is running out of data.” What can researchers do?, December 11, 2024 The paper is open access.
So what happens when they run out of data?
The researchers say they have workarounds that include creating new data and searching out unconventional data sources. But how well will that work? It’s not just new “data” that is needed but new information. A quantifiable amount of human input goes into producing new information. And unconventional data sources may be unconventional because few are interested in the contents.
Besides, the publishers whose information has been snaffled by chatbots without attribution are beginning to fight back:
At the same time, content providers are increasingly including software code or refining their terms of use to block web crawlers or AI companies from scraping their data for training. Longpre and his colleagues released a preprint this July showing a sharp increase in how many data providers block specific crawlers from accessing their websites2. In the highest-quality, most-often-used web content across three main cleaned data sets, the number of tokens restricted from crawlers rose from less than 3% in 2023 to 20–33% in 2024. “Running out of data.”
This is to say nothing of lawsuits for copyright infringement.
One strategy under consideration is to refine the models so that they do more with less data. But some believe that “advances might soon come through self-reflection by an AI. ‘Now it’s got a foundational knowledge base, that’s probably greater than any single person could have,’ says [AI security researcher Andy] Zou, meaning it just needs to sit and think. ‘I think we’re probably pretty close to that point.’”
Sit and think? Hope springs eternal.