Is AI Training on Libraries of Pirated Books?
The New York Times points out that so-called "shadow libraries," like Library Genesis, Z-Library or Bibliotik, "are obscure repositories storing millions of titles, in many cases without permission - and are often used as A.I. training data."A.I. companies have acknowledged in research papers that they rely on shadow libraries. OpenAI's GPT-1 was trained on BookCorpus, which has over 7,000 unpublished titles scraped from the self-publishing platform Smashwords. To train GPT-3, OpenAI said that about 16 percent of the data it used came from two "internet-based books corpora" that it called "Books1" and "Books2." According to a lawsuit by the comedian Sarah Silverman and two other authors against OpenAI, Books2 is most likely a "flagrantly illegal" shadow library. These sites have been under scrutiny for some time. The Authors Guild, which organized the authors' open letter to tech executives, cited studies in 2016 and 2017 that suggested text piracy depressed legitimate book sales by as much as 14 percent. Efforts to shut down these sites have floundered. Last year, the F.B.I., with help from the Authors Guild, charged two people accused of running Z-Library with copyright infringement, fraud and money laundering. But afterward, some of these sites were moved to the dark web and torrent sites, making it harder to trace them. And because many of these sites are run outside the United States and anonymously, actually punishing the operators is a tall task. Tech companies are becoming more tight-lipped about the data used to train their systems.
Read more of this story at Slashdot.