Abusive AI Crawlers Run Up Large Bandwidth Bills for their Targets
canopic jug writes:
An increasing number of sites are reporting about increased bandwidth being lost to AI crawlers. The documentation sharing site, Read the Docs, has an analysis of the attacks against it by AI crawlers. Several examples are included.
We have been seeing a number of bad crawlers over the past few months,but here are a couple illustrative examples of the abuse we're seeing:
73 TB in May 2024 from one crawler
One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.
[...] This was a bug in their crawler that was causing it to download the same files over and over again.There was no bandwidth limiting in place,or support for Etags and Last-Modified headers which would have allowed the crawler to only download files that had changed.We have reported this issue to them,and hopefully the issue will be fixed.
Many of the bots even ignore the robots.txt file and its contents.
Read more of this story at SoylentNews.