Reddit Introduces Update to Protect Its Platform from Web Scrapers

Krishi Chowdhary

from Techreport on 2024-06-28 13:54 (#6NVSK)

reddit_header_2023-11-28-222257_hthh-120

Reddit has announced that it will update its Robots Exclusion Protocol," or robots.txt, to prevent unknown bots and crawlers from scraping its content.
Web scrapers not only violate its data policies but also make the site slow for others, which is why this is an important step.
It's crucial to note that this update will not impact regular Reddit users, companies that have a deal with Reddit to use its data, or genuine users who are scraping for research purposes.

reddit_header_2023-11-28-222257_hthh-300

On Tuesday (June 25), Reddit announced that it will introduce new safety features to prevent unauthorized parties from scraping its site for content.

With the recent boom in the AI industry, there has been a huge increase in demand for content for training purposes. A lot of companies have turned to public sites like Reddit to scrape content.

However, web scraping, which is the process of using tools or bots to extract content from a website, not only violates Reddit's data handling policy but has also made it slower for other users.

To address this issue, Reddit even published a new Public Content Policy" that outlined how its data should be used, both by researchers and AI companies. However, since that doesn't seem to be effective enough, it has finally turned to technology to keep away freeloaders.

What Exactly Is the New Reddit Feature?Reddit's new feature is basically an update to an existing technology called Robots Exclusion Protocol," or robots.txt. It's, simply put, a file containing instructions to web crawlers on what data is allowed to be taken from a site.

This update will also block unknown bots and crawlers, including those who don't have an agreement with Reddit about using its data. For example, both OpenAI and Google have agreements with the platform to use its content for AI training.

It's well worth noting that regular Reddit users will not be affected by this update. The company also added that good faith actors," who are scraping for genuine research purposes, will not be hindered, either.

Why Is Reddit's Concern Valid?

Some might feel that Reddit is overreacting to the situation. However, the company has strong reasons to be concerned.

For starters, this update comes just a few days after an investigation revealed that popular AI company Perplexity is stealing content from public sites. It even ignores the not-so-steal requests from the robots.txt file and its CEO Aravind Srinivas said that since it's not a legal framework, there's no harm in ignoring it.

Secondly, when AI firms steal content through scraping, they dampen the experience of other users by slowing down the site, which doesn't even get anything in return.

On the other hand, when recognized partners like Google and OpenAI use content from Reddit, their users get access to AI features and the company gets paid for its work.

So, after a fair assessment of the issue, I believe Reddit has all the reasons (and right) to try and protect its platform from unwanted scrapers.

The post Reddit Introduces Update to Protect Its Platform from Web Scrapers appeared first on The Tech Report.

Source	RSS or Atom Feed
Feed Location	https://techreport.com/feed/
Feed Title	Techreport
Feed Link	https://techreport.com/