Article 7105P Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet

Reddit’s ‘AI Scraping’ Lawsuit Is An Attack On The Open Internet

by
Mike Masnick
from Techdirt on (#7105P)
Story Image

When Reddit sued data scraper" companies and AI firm Perplexity earlier this week, I assumed it was another predictable skirmish over AI training data-the kind of case we've been tracking as companies try to wall off the open internet and set up toll booths. But reading the actual complaint made it clear this is something far more dangerous: Reddit isn't just going after scrapers. It's mounting a fundamental attack on the very concept of an open internet, using a twisted reading of copyright law that-if it succeeds-would break how search engines, archives, and the web itself operate.

Even if you love Reddit and hate AI, you should be worried about this lawsuit. If it succeeds, it would fundamentally close off most of the open internet.

Most reporting on this is not actually explaining the nuances, which require a deeper understanding of the law, but fundamentally, Reddit is NOT arguing that these companies are illegally scraping Reddit, but rather that they are illegally scraping... Google (which is not a party to the lawsuit) and in doing so violating the DMCA's anti-circumvention clause, over content Reddit holds no copyright over. And, then, Perplexity is effectively being sued for linking to Reddit.

This is... bonkers on so many levels. And, incredibly, within their lawsuit, Reddit defends its arguments by claiming it's filing this lawsuit to protect the open internet. It is not. It is doing the exact opposite.

The Background

It is totally reasonable to be concerned about the burden that data scrapers put on websites, and to talk about ways to deal with them. But that's not what this lawsuit really is. It's mostly focused on some companies that effectively have built unofficial APIs for getting search results data out of Google. That can be quite useful in some cases! But also, some of the companies in this space can be fairly sketchy. Reddit leans heavily on the sketchiness of the companies to imply they're bad."

But, an open web must mean a programmable web of some sort. Building on other services is a fundamental part of the open web and has always been there. If the building becomes abusive, then there are often technical ways of dealing with it. But here, the abuse" seems to be Reddit signed a $60 million scraping deal with Google, which was already kinda sketchy.

After all, Reddit has a license to the content users post in order to operate the service, but they don't hold the copyright on it. Indeed, Reddit's terms state clearly that users retain any ownership rights you have in Your content." Because of Reddit's agreement that it can license content, the deal with Google could sorta squeeze under that term, but that doesn't give Reddit the right to then sue over users' copyrights (as it's doing in this case).

Either way, there's an indication that Reddit has gotten greedy. It's apparently reopened negotiations with Google recently, seeking more money and traffic. But it also wants money from other AI providers. Apparently, that includes Perplexity, which is a pretty useful AI answer engine" that lets users select from a variety of underlying LLMs (Perplexity has released its own LLMs, but they were modifications of open source LLMs including Llama (from Meta) and Mistral, a popular open source LLM from France. Thus, while Perplexity has offered its own models, it didn't train them itself).

Because Perplexity is much more focused on being an alternative to a search engine than a traditional chat bot," its focus in answering your questions is to actually provide links as sources for the answers it gives. In effect, it combines a traditional search engine with an LLM and it did this before many other chatbot LLMs added web search capabilities (though most now have them).

But that means, if an answer" to a question from a user comes from a Reddit post, Perplexity is likely to link to it, just like a regular search engine. But, Reddit wants to get paid. And because Reddit has become so closed and persnickety about things, it looks like Perplexity may have chosen to use these other data scraping firms' unofficial Google search results APIs to find Reddit posts and link to them.

This is... how the open internet is supposed to work, actually. But Reddit presents it as a sneaky circumvention."

Recognizing that Reddit denies scrapers like them access to its site, Defendants SerpApi, Oxylabs, and AWMProxy scrape the data from Google's search results instead. They do so by masking their identities, hiding their locations, and disguising their web scrapers as regular people (among other techniques) to circumvent or bypass the security restrictions meant to stop them. For example, during a two-week span in July 2025, Defendants SerpApi, Oxylabs, and AWMProxy circumvented Google's technological control measures and automatedly accessed, without authorization, almost three billion search engine results pages (SERPs") containing Reddit text, URLs, images, and videos.

That's Not How Circumvention Works

So you might notice something weird in the paragraph above. Namely the claim that the API/scraping companies circumvented Google's technological control measures."

The phrase Technological control measures" (TCMs) should set off alarm bells for copyright nerds. It's part of Section 1201 of the DMCA, or the anti-circumvention" provision. We've talked about it for ages, how it's widely abused, how it threatens innovation, and how it should be abolished.

The fundamental issue is that it says any attempt to circumvent a technological measure" that tries to protect a copyright-protected work is, itself, copyright infringement. And that's even if the goal of the circumvention is not even to infringe on the underlying copyright at all. That's why we've seen attempts by companies to use 1201 to, say, block people from using cheaper ink jet cartridges, or getting a cheaper garage door opener. Neither of those sound like copyright issues (because they're not), but companies tried to abuse 1201 by claiming they put technological control measures" on those devices, and any circumvention" should then be seen as infringement.

But here, Reddit is doing something even crazier. Because it's saying that since these companies (allegedly) get around Google's technological measures, then somehow Reddit can accuse them of violating 1201.

Reddit and Google have implemented technological measures that effectively control access to Reddit content. Both companies use advanced technological techniques, as described above, to control unauthorized, automated access to their server systems. These measures, in the ordinary course of their operation, limit the freedom and ability of users to access Reddit content, including by prohibiting automated entities from accessing search engine result pages and scraping search engine results that include Reddit content.

Defendants' actions violate 17 U.S.C. 1201(a)(1)(A), under which no person shall circumvent a technological measure that effectively controls access to a copyrighted work. Defendants have circumvented these measures in one or more ways, including:

a. Avoiding or bypassing Reddit's measures entirely in order to obtain Reddit's content and services, and the content authored by its users, that appear in Google search results; and

b. Avoiding, removing, deactivating, impairing, and/or bypassing SearchGuard and Google's other technological control measures by using devices, systems, processes, and/or protocols, including large-volume proxy networks, to improperly gain access to Google Search results.

Let's break this down, because we have to look at how crazy this is.

  1. They're saying that these companies are avoiding or bypassing" Reddit's TCMs. But, the way they're doing that is by not scraping Reddit. You cannot claim that it is circumventing a TCM" to get the same content... from Google. That's crazy.
  2. Even crazier is that they're arguing that the defendants are circumventing Google's TCM, even though Google isn't even a party.
  3. They're making this claim over content that Reddit holds no copyright over. The copyright remains with the original creator. Reddit holds a license, but a license does not grant Reddit the right to sue over that copyright.

Each one of these ideas is crazy. All three of them together is ludicrous. Reddit is claiming that these companies violated copyright law by (1) avoiding Reddit and (2) getting the content from publicly available Google searches over (3) content that Reddit has no copyright over.

And somehow that's supposed to be copyright infringement.

This Is Not Protecting the Open Internet

Even more obnoxiously, Reddit crowns itself a protector of the open internet with this nonsense:

Because Reddit has always believed in the open internet, it takes its role as a steward of its users' communities, discussions, and authentic human discourse seriously.

Elsewhere in the lawsuit, it says:

As articulated in its Public Content Policy, Reddit believes in an open internet, but it do[es] not believe that third parties have a right to misuse public content just because it's public."

If that's the case, then... you don't believe in an open internet. Text and data mining is a part of the open internet. Building on the work of others is part of the open internet. You can't just claim we support the open internet, but not if we say you're misusing it." It's not your call.

Yes, there are copyright restrictions on what you can do with others' content, but (again) Reddit has no copyright interest here. And it can't even legitimately claim a circumvention" of a TCM just because these companies got the same data elsewhere.

This Isn't Even About Training

Some people will still insist this is bad because they hate all AI training based on scraping, but that's not even what's happening here. We discussed this a bit in our last piece on cutting off the open internet. It's one thing to argue that you want to block your content from being trained upon, but it's a wholly different thing to say you can't retrieve this page based on a user search." That latter scenario is the basis of how search engines exist online, which are fundamental to an open web.

But, as Perplexity notes in its response to the lawsuit (ironically, in the Perplexity subreddit on Reddit), that's exactly what Reddit is looking to block:

What does Perplexity actually do with Reddit content? We summarize Reddit discussions, and we cite Reddit threads in answers, just like people share links to posts here all the time. Perplexity invented citations in AI for two reasons: so that you can verify the accuracy of the AI-generated answers, and so you can follow the citation to learn more and expand your journey of curiosity.

And that's what people use Perplexity for: journeys of curiosity and learning. When they visit Reddit to read your content it's because they want to read it, and they read more than they would have from a Google search.

The company also notes that Reddit demanded Perplexity license its data, but Perplexity explained to them (as mentioned above) that they don't train their own LLM so they don't need to license data for training.

Here's where we push back. Reddit told the press we ignored them when they asked about licensing. Untrue. Whenever anyone asks us about content licensing, we explain that Perplexity, as an application-layer company, does not train AI models on content. Never has. So it is impossible for us to sign a license agreement to do so.

A year ago, after explaining this, Reddit insisted we pay anyway, despite lawfully accessing Reddit data. Bowing to strong arm tactics just isn't how we do business.

For what it's worth, Perplexity also claims that this is part of Reddit's plan to extort" more money from Google.

This is an Anti-Open Internet Lawsuit

If this lawsuit succeeds, it would signal a huge destruction of the open internet. It would fundamentally make it impossible for search engines to work without licensing all content. It would, in effect, close off huge parts of the open internet to only those with the largest wallets.

Beyond that, it would extend our understanding of Section 1201's anti-circumvention provisions to absurdity. Saying that not scraping your site is circumvention? Crazy. Saying that (allegedly) bypassing" someone else's technological measures lets you sue? Absurd. And saying that you can do all that over content you don't even hold the copyright on? Preposterously stupid.

If this lawsuit succeeds, it would open up a cottage industry of frivolous lawsuits, while greatly diminishing the nature of the open web.

I've long considered Reddit one of the good" examples of how narrow, more focused, communities can operate. On our latest Ctrl-Alt-Speech, we talked about how it's one of the examples of the good" parts of the internet. I know and respect many people at Reddit, including on their legal team.

But I just don't get this lawsuit. It seems massively destructive to the open internet in what appears to be a very misguided and mis-targeted attempt to shake down extra licensing revenue. There are better ways to do this, and I hope that Reddit reconsiders its approach.

External Content
Source RSS or Atom Feed
Feed Location https://www.techdirt.com/techdirt_rss.xml
Feed Title Techdirt
Feed Link https://www.techdirt.com/
Reply 0 comments