OpenAI Desperate to Avoid Explaining Why It Deleted Pirated Book Datasets

jelizondo

from SoylentNews on 2025-12-06 21:11 (#71ZRZ)

OpenAI desperate to avoid explaining why it deleted pirated book datasets:

OpenAI may soon be forced to explain why it deleted a pair of controversial datasets composed of pirated books, and the stakes could not be higher.
At the heart of a class-action lawsuit from authors alleging that ChatGPT was illegally trained on their works, OpenAI's decision to delete the datasets could end up being a deciding factor that gives the authors the win.
It's undisputed that OpenAI deleted the datasets, known as "Books 1" and "Books 2," prior to ChatGPT's release in 2022. Created by former OpenAI employees in 2021, the datasets were built by scraping the open web and seizing the bulk of its data from a shadow library called Library Genesis (LibGen).
As OpenAI tells it, the datasets fell out of use within that same year, prompting an internal decision to delete them.
But the authors suspect there's more to the story than that. They noted that OpenAI appeared to flip-flop by retracting its claim that the datasets' "non-use" was a reason for deletion, then later claiming that all reasons for deletion, including "non-use," should be shielded under attorney-client privilege.
To the authors, it seemed like OpenAI was quickly backtracking after the court granted the authors' discovery requests to review OpenAI's internal messages on the firm's "non-use."
In fact, OpenAI's reversal only made authors more eager to see how OpenAI discussed "non-use," and now they may get to find out all the reasons why OpenAI deleted the datasets.
Last week, US magistrate judge Ona Wang ordered OpenAI to share all communications with in-house lawyers about deleting the datasets, as well as "all internal references to LibGen that OpenAI has redacted or withheld on the basis of attorney-client privilege."
According to Wang, OpenAI slipped up by arguing that "non-use" was not a "reason" for deleting the datasets, while simultaneously claiming that it should also be deemed a "reason" considered privileged.
Either way, the judge ruled that OpenAI couldn't block discovery on "non-use" just by deleting a few words from prior filings that had been on the docket for more than a year.
"OpenAI has gone back-and-forth on whether 'non-use' as a 'reason' for the deletion of Books1 and Books2 is privileged at all," Wang wrote. "OpenAI cannot state a 'reason' (which implies it is not privileged) and then later assert that the 'reason' is privileged to avoid discovery."
Additionally, OpenAI's claim that all reasons for deleting the datasets are privileged "strains credulity," she concluded, ordering OpenAI to produce a wide range of potentially revealing internal messages by December 8. OpenAI must also make its in-house lawyers available for deposition by December 19.
OpenAI has argued that it never flip-flopped or retracted anything. It simply used vague phrasing that led to confusion over whether any of the reasons for deleting the datasets were considered non-privileged. But Wang didn't buy into that, concluding that "even if a 'reason' like 'non-use' could be privileged, OpenAI has waived privilege by making a moving target of its privilege assertions."
Asked for comment, OpenAI told Ars that "we disagree with the ruling and intend to appeal."

Source	RSS or Atom Feed
Feed Location	https://soylentnews.org/index.rss
Feed Title	SoylentNews
Feed Link	https://soylentnews.org/
Feed Copyright	Copyright 2014, SoylentNews