Tech writer: Chatbots are trained on a trove of pirated books
At The Atlantic, programmer and tech writer Alex Reisner pulls back the curtain on the massive copyright violations that may be required to train AI chatbots (large language models or LLMs). A current court case provided some insights re from employees at Meta (Facebook’s parent company):
Court documents released last night show that the senior manager felt it was “really important for [Meta] to get books ASAP,” as “books are actually more important than web data.” Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.
This act, along with other information outlined and quoted here, recently became a matter of public record when some of Meta’s internal communications were unsealed as part of a copyright-infringement lawsuit brought against the company by Sarah Silverman, Junot Díaz, and other authors of books in LibGen.
“The Unbelievable Scale of AI’s Pirated-Books Problem,” March 20, 2025

As with any court case, accusations swirl back and forth. But the basic problem is simple: Because much more research and editing usually goes into books than into, say, online rants, the quality of the information gained from them is much more desirable. But authors and their publishers understandably want money for their work. And the AI developers want to avoid paying them. Thus, the use of the book piracy website is certainly a temptation.
The court must decide if Meta succumbed in these specific cases. Meanwhile, Mark Zuckerberg admits that the Meta AI assistant is embedded in Facebook, Instagram, and What’sApp, used by hundreds of millions of people.
Book piracy sites
Piracy sites, as Reisner reports, are often developed and domiciled in various jurisdictions across the globe. Thus he tells us that prominent publishers have gotten high-dollar court judgments against pirates in U.S. courts but the fines have gone unpaid.
Worse, he writes,
… generative-AI chatbots are presented as oracles that have “learned” from their training data and often don’t cite sources (or cite imaginary sources). This decontextualizes knowledge, prevents humans from collaborating, and makes it harder for writers and researchers to build a reputation and engage in healthy intellectual debate. Generative-AI companies say that their chatbots will themselves make scientific advancements, but those claims are purely hypothetical. “AI’s Pirated-Books Problem”
It’s nice to hear someone in tech admit that such claims are purely hypothetical…
Reisner asks, “Will these be better for society than the human dialogue they are already starting to replace?” We probably all know the answer to that one. And if copyright proves unsustainable in the age of chatbots, there will be much less incentive for humans to produce the creative works that the AI depends on.