This is pretty wild — our friends at "Read The Docs" saw file download traffic drop by 75% after blocking AI bots.

https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

AI crawlers need to be more respectful

We talk a bit about the AI crawler abuse we are seeing at Read the Docs, and warn that this behavior is not sustainable.

Read the Docs
Why the Data Ocean Is Being Sectioned Off

Bigger is better approaches in AI create an inexhaustible appetite for users’ data, leading to a rise in user data expropriation, sectioning off of the internet, and “data feudalism.”

Default

@stefan @paulshryock not sure about this modern technique of scaling up endlessly. Ive always been pretty happy with “when i get effectively DDOSed, I go down, im not paying for their abuse”.

I think the web would really be better overall if more people did that, possibly with “yeah the AI bots are trying to screw us again” announcements :(

@stefan "One crawler downloaded 73 TB of zipped HTML files in May 2024" 😬
@stefan most irritatingly, we creators are now paying at both ends for people taking our work
@chrischinchilla @stefan 💯 the crappy part. Unionized creators (eg, SAG-AFTA) have a hard enough time getting fair pay for their work post-AI manipulation. Can’t imagine being independent creator trying to protect their work.
@stefan while this is important and good and I'm happy for them, the listed traffic costs are ridiculous
@canteen @stefan I see you don't use AWS. Good.

They charge 0.09$/GB of egress.
@privateger @stefan The point I was making is that nobody should use AWS or any related exploitative cloud platforms
@stefan they should have known it was AI bots, humans don't read documentation /s
@stefan ugh.... so annoying that these AI companies are just chowing down on everyone's public data... which I personally don't mind to an extent but when they then sit down and turn around and sell it to everyone else the stuff they ingested for free... after actively knowing how much you are scraping, they really need to at least donate back to the sources they are scraping AT MINIMUM

@stefan I've over-ridden the robots.txt at my nginx load balancer for over 300 websites.

within the nginx server level I now have this:

location = /robots.txt {
add_header Content-Type text/plain;
return 200 "User-agent: *\nDisallow: /\n";
}

Am simply returning a disallow to all bots now... and backing that up with about a hundred blocks rules implemented against ASNs, user agents, specific IPs, and rate limits across the board.

@stefan I've been seeing a very similar pattern on my much smaller site: https://androiddev.social/@msfjarvis/112848867904557404
Harsh Shandilya (@[email protected])

Attached: 1 image I started redirecting #AI crawlers away based on their user agents and the 307s are now the bulk of my traffic. I sure do love the AI revolution

Android Dev Social
@stefan I never thought about the cost that AI crawling causes for the hoster. So not only does the AI bot steal content, it costs that company money and gives nothing in return.
Haha no surprises there, I had to block a lot of these bots as well since they literally DDoS you.
Unreal that people are paying to have their content stolen
@stefan I hope they make their blocklist available, so that others can also limit that kind of traffic :)
@stefan so, hypothetically speaking, what if they just served made up garbage to those clients?

@stefan
Cumulative traffic for cromwell-intl.com plus toilet-guru.com running on one FreeBSD host in the Google Cloud has been averaging a little over $1/day in outbound traffic over the past year. On July 2, I added some commonly recommended AI-bot-blocking to robots.txt. A week later, traffic had dropped to a little under 50% of what it had been.

Blue in top of each = traffic Americas -> Americas
Yellow, #3 in each = traffic Americas -> EMEA
Purple at bottom = offset for free CPU/RAM tier

@stefan
[narrator voice]
Only later did he realize that he had broken a Python dependency for his two tootbots early on July 9th.
@stefan
Dark Visitors has an API for updating .htaccess to keep up with the ever-changing AI user agents (freemium service):
https://darkvisitors.com/docs/robots-txt
Set Up Automatic Robots.txt | Dark Visitors

How to get started with automatic robots.txt

Dark Visitors