Article 6E4TC Anti-Piracy Group Takes Ai Training Dataset 'Books3′ Offline

Anti-Piracy Group Takes Ai Training Dataset 'Books3′ Offline

by
hubie
from SoylentNews on (#6E4TC)

Arthur T Knackerbracket writes:

Arthur T Knackerbracket has processed the following story:

One of the most prominent pirated book repositories used for training AI, Books3, has been kicked out from the online nest it had been roosting in for nearly three years. Rights-holders have been at war with online pirates for decades, but artificial intelligence is like oil seeping into copyright law's water. The two simply do not mix, and the fumes rising from the surface just need a spark to set the entire concept of intellectual property rights alight.

As first reported by TorrentFreak, the large pirate repository The Eye took down the Books3 dataset after the Danish anti-piracy group Rights Alliance sent the site a DMCA takedown. Now trying to access that dataset gives a 404 error. The Eye still hosts other training data for AI, but the portion allotted for books has vanished.

[...] . The nonprofit research group EleutherAI originally released Books3 as a part of the AI training set The Pile, an 800 GB open source chunk of training data comprising 22 other datasets specifically designed for training language models. Rights Group said the organization denied responsibility" for Books3. Gizmodo reached out to EleutherAI for comment, but we did not receive a response.

The Eye claims it regularly complies with all valid DMCA requests, though that data set was originally uploaded by AI developer and prominent open source AI proponent Shawn Presser back in 2020. His stated goal at the time was to open up AI development beyond companies like OpenAI, which trained its earlier large language models on the still-unknown Books1" and Books2" repositories. The Books3 repository contained 196,640 books all in plain.txt format and was supposed to give fledgling AI projects a leg up against the likes of ChatGPT-maker OpenAI.

Over Twitter DM, Presser called the attack on Books3 a travesty for open source AI. While other major companies and VC-funded startups get away with including copyrighted data in their training data, grassroots projects need something to compete-and that's what Books3 was for.

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments