Crawlers And Agents And Bots, Oh My: Time To Clarify Robots.txt

Mike Masnick

from Techdirt on 2024-07-03 20:27 (#6NZFJ)

Perplexity is an up-and-coming AI company that has broad ambition to compete with Google in the search market by providing answers to user queries with AI as its core technology.

They've been in the news because their news feature repurposed content published on the Forbes website in an investigative article, which severely annoyed the Forbes editorial staff and media community (never a good idea) and led to accusations from Forbes' legal team of willful copyright infringement. Now Wired is reporting that Perplexity's web hosting provider (AWS) is investigating their practices, focused on whether they respect robots.txt, the standard governing the behavior of web crawlers (Or is it all robots? More on that later.)

We don't know everything about how Perplexity actually works under the hood, and I have no relationship to the company or special knowledge. The facts are still somewhat murky, and as with any dispute over the ethics or legality of digital copying, the technical details will matter. I worked on copyright policy for years at Google, and have seen this pattern play out enough times to not pass judgment too quickly.

Based on what we know today from press reports, it seems plausible to me that the fundamental issue at root here, i.e. what is driving Perplexity to dig its heels in, and where much of the reporting seems to cite as Perplexity's fundamental ethical failing, is what counts as a crawler" for the purposes of robots.txt.

This is an ambiguity that will likely need to be addressed in years to come regardless of Perplexity's practices, so it seems worth unpacking a little bit. (In fact similar questions are floating around Quora's chatbot Poe.)

Why do I think this is the core issue? This snippet from today's Wired article was instructive (Platnick is a Perplexity spokesperson):

When a user prompts with a specific URL, that doesn't trigger crawling behavior," Platnick says. The agent acts on the user's behalf to retrieve the URL. It works the same way as if the user went to a page themselves, copied the text of the article, and then pasted it into the system."
This description of Perplexity's functionality confirms WIRED's findings that its chatbot is ignoring robots.txt in certain instances.

The phrase ignoring robots.txt in certain instances" sounds bad. There is the ethical conversation of what Perplexity is doing with news content of course, which is likely to be an ongoing and vigorous debate. The claim is that Perplexity is ignoring the wishes of news publishers, as expressed in robots.txt.

But we tend to codify norms and ethics into rules, and a reasonable question is: What does the robots.txt standard have to say? When is a technical system expected to comply with it, or ignore it? Could this be rooted in different interpretations of the standard?

First a very quick history of robots.txt: In the late 80s and early 90s, it was a lot more expensive to run a web server. They also tended to be very prone to breaking under high loads. As companies began to crawl the web to build things like search engines (which requires accessing a lot of the website), stuff started to break, and the blessed nerds who kept the web working came up with an informal standard in the mid 90s that allowed webmasters to put up road signs to direct crawlers away from certain areas. Most crawlers respected this relatively informal arrangement, and still do.

Thus, crawlers" has for decades been understood to refer to systems that access URLs in bulk, systems that pick which URLs to access next based on a predetermined method written in code (presumably why it's described as crawling"). And the motivating issue was mainly a coordination problem: how to enable useful services like search engines, that are good for everyone including web publishers, without breaking things.

It took nearly two decades but robots.txt was eventually codified and adopted as the Robots Exclusion Protocol, or RFC 9309, by the Internet Engineering Task Force (IETF), part of the aforementioned blessed nerd community who maintain the technical standards of the internet.

RFC 9309 does not define crawler" or robot" in the way a lawyer might expect a contract or statute to define a term. It says simply that crawlers are automatic clients" with the rest left up to context clues. Most of those context clues refer to issues posed by bulk access of URIs:

It may be inconvenient for service owners if crawlers visit the entirety of their URI space. This document specifies the rules [...] that crawlers are requested to honor when accessing URIs.

Every year the web's social footprint expands and we increase the pressures put on robots.txt. It's begun to solve a broader set of challenges, beyond protecting webmasters from the technical inconveniences of bulk access. It now increasingly arbitrates massive economic interests, and now the social and ethical questions AI has inspired in recent years. Google, whose staff are the listed authors of RFC 9309, has already started thinking about what's next.

And the technology landscape is shifting. Automated systems are accessing web content with a broader set of underlying intentions. We're seeing the emergence of AI agents that actually do things on behalf of users and at their direction, intermediated by AI companies using large language models. As OpenAI says, AI agents may substantially expand the helpful uses of AI systems, and introduce a range of new technical and social challenges."

Automatic clients will continue to access web content. The user-agent might even reasonably have Bot" in the name. But is it a crawler? It won't be for the same purpose as a search engine crawler, and not at the same scale and depth required for search. The ethical, economic, technical, and legal landscape for automatic AI agents will look completely different than for crawlers.

It may very well be sensible to expand RFC 9309 to apply to things like AI agents directed by users, or any method of automated access of web content where the user-agent isn't directly a user's browser. And then we would think about the cascading implications of the robots.txt standard and its requirements. Or maybe we need a new set of norms and rules to govern that activity separate from RFC 9309.

Either way, disputes like this are an opportunity to consider improving and updating the rules and standards that guide actors on the web. To the extent this disagreement really is about the interpretation of crawler" in RFC 9309, i.e. what counts as a robot or crawler and therefore what must respect listed disallows in the robots.txt file, that seems like a reasonable place to start thinking about solutions.

AlexKozakis a tech policy consultant with Proteus Strategies, formerly gov't affairs and regulatory strategy at Google X, global copyright policy lead at Google, and open licensing advocate at Creative Commons.

Source	RSS or Atom Feed
Feed Location	https://www.techdirt.com/techdirt_rss.xml
Feed Title	Techdirt
Feed Link	https://www.techdirt.com/