Article 6YJNJ AI is Scraping the Web, but the Web is Fighting Back

AI is Scraping the Web, but the Web is Fighting Back

by
janrinok
from SoylentNews on (#6YJNJ)

upstart writes:

AI Is Scraping the Web, but the Web Is Fighting Back:

AI is not magic. The tools that generate essays or hyper-realistic videos from simple user prompts can only do so because they have been trained on massive data sets. That data, of course, needs to come from somewhere, and that somewhere is often the stuff on the internet that's been made and written by people.

The internet happens to be quite a large source of data and information. As of last year, the web contained 149 zettabytes of data. That's 149 million petabytes, or 1.49 trillion terabytes, or 149 trillion gigabytes, otherwise known as a lot. Such a collective of textual, image, visual, and audio-based data is irresistible to AI companies that need more data than ever to keep growing and improving their models.

So, AI bots scrape the worldwide web, hoovering up any and all data they can to better their neural networks. Some companies, seeing the business potential, inked deals to sell their data to AI companies, including companies like Reddit, the Associated Press, and Vox Media. AI companies don't necessarily ask permission before scraping data across the internet, and, as such, many companies have taken the opposite approach, launching lawsuits against companies like OpenAI, Google, and Anthropic. (Disclosure: Lifehacker's parent company, Ziff Davis, filed a lawsuit against OpenAI in April, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

Those lawsuits probably aren't slowing down the AI vacuum machines. In fact, the machines are in desperate need of more data: Last year, researchers found that AI models were running out of data necessary to continue with the current rate of growth. Some projections saw the runway giving out sometime in 2028, which, if true, gives only a few years left for AI companies to scrape the web for data. While they'll look to other data sources, like official deals or synthetic data (data produced by AI), they need the internet more than ever.

If you have any presence on the internet whatsoever, there's a good chance your data was sucked up by these AI bots. It's scummy, but it's also what powers the chatbots so many of us have started using over the past two and a half years.

But just because the situation is a bit dire for the internet at large, that doesn't mean its giving up entirely. On the contrary, there is real opposition to this type of practice, especially when it goes after the little guy.

Read more of this story at SoylentNews.

External Content
Source RSS or Atom Feed
Feed Location https://soylentnews.org/index.rss
Feed Title SoylentNews
Feed Link https://soylentnews.org/
Feed Copyright Copyright 2014, SoylentNews
Reply 0 comments