An AI Scraping Tool Is Overwhelming Websites With Traffic

BeauHD

from Slashdot on 2023-04-26 00:02 (#6B5CY)

An anonymous reader quotes a report from Motherboard: The creator of a tool that scrapes the internet for images in order to power artificial intelligence image generators like Stable Diffusion is telling website owners who want him to stop that they have to actively opt out, and that it's "sad" that they are fighting the inevitable rise of AI. "It is sad that several of you are not understanding the potential of AI and open AI and as a consequence have decided to fight it," Romain Beaumont, the creator of the image scraping tool img2dataset, said on its GitHub page. "You will have many opportunities in the years to come to benefit from AI. I hope you see that sooner rather than later. As creators you have even more opportunities to benefit from it." Img2dataset is a free tool Beaumont shared on GitHub which allows users to automatically download, and resize a list of URLs. The result is an image dataset, the kind that trains image-generating AI models like Open AI's DALL-E, the open source Stable Diffusion model, and Google's Imagen. Beaumont is also an open source contributor to LAION-5B, one of the largest image datasets in the world that contains more than 5 billion images and is used by Imagen and Stable Diffusion. Img2dataset will attempt to scrape images from any site unless site owners add https headers like "X-Robots-Tag: noai," and "X-Robots-Tag: noindex." That means that the onus is on site owners, many of whom probably don't even know img2dataset exists, to opt out of img2dataset rather than opt in.Beaumont defended img2dataset by comparing it to the way Google indexes all websites online in order to power its search engine, which benefits anyone who wants to search the internet. "I directly benefit from search engines as they drive useful traffic to me," Eden told Motherboard. "But, more importantly, Google's bot is respectful and doesn't hammer my site. And most bots respect the robots.txt directive. Romain's tool doesn't. It seems to be deliberately set up to ignore the directives website owners have in place. And, frankly, it doesn't bring any direct benefit to me." Motherboard notes: "A 'robots.txt' file tells search engine crawlers like Google which part of a site the crawler can access in order to prevent it from overloading the site with requests."

Source	RSS or Atom Feed
Feed Location	https://rss.slashdot.org/Slashdot/slashdotMain
Feed Title	Slashdot
Feed Link	https://slashdot.org/
Feed Copyright	Copyright Slashdot Media. All Rights Reserved.