Mastodawn

William Killerud Jul 29, 2024

Stefan Judis

This is pretty wild — our friends at "Read The Docs" saw file download traffic drop by 75% after blocking AI bots.

https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

AI crawlers need to be more respectful

We talk a bit about the AI crawler abuse we are seeing at Read the Docs, and warn that this behavior is not sustainable.

Read the Docs

Show thread

THIS ACCOUNT HAS MOVED Jul 28, 2024

@stefan it's a "land grab" https://www.lawfaremedia.org/article/why-the-data-ocean-is-being-sectioned-off

Why the Data Ocean Is Being Sectioned Off

Bigger is better approaches in AI create an inexhaustible appetite for users’ data, leading to a rise in user data expropriation, sectioning off of the internet, and “data feudalism.”

Default

Show thread

Lea de Groot TZ+10 Jul 28, 2024

@stefan @paulshryock not sure about this modern technique of scaling up endlessly. Ive always been pretty happy with “when i get effectively DDOSed, I go down, im not paying for their abuse”.

I think the web would really be better overall if more people did that, possibly with “yeah the AI bots are trying to screw us again” announcements :(

Show thread

Giflian Jul 29, 2024

@stefan "One crawler downloaded 73 TB of zipped HTML files in May 2024" 😬

Show thread

ChrisChinchilla Jul 29, 2024

@stefan most irritatingly, we creators are now paying at both ends for people taking our work

Show thread

Jennifer Deseo Jul 29, 2024

@chrischinchilla @stefan 💯 the crappy part. Unionized creators (eg, SAG-AFTA) have a hard enough time getting fair pay for their work post-AI manipulation. Can’t imagine being independent creator trying to protect their work.

Show thread

canteen Jul 29, 2024

@stefan while this is important and good and I'm happy for them, the listed traffic costs are ridiculous

Show thread

Latte macchiato

Jul 29, 2024

@canteen @stefan I see you don't use AWS. Good.

They charge 0.09$/GB of egress.

Show thread

canteen Jul 29, 2024

@privateger @stefan The point I was making is that nobody should use AWS or any related exploitative cloud platforms

Show thread

shom 🐧📷🤿🏔️🪚✊🏽Jul 29, 2024

@stefan they should have known it was AI bots, humans don't read documentation /s

Show thread

Paul Shryock Jul 29, 2024

@shom @stefan you're not wrong.

Show thread

Mark Gardner Jul 29, 2024

@shom @stefan 👏🏻

Show thread

HACK13 Jul 29, 2024

@stefan ugh.... so annoying that these AI companies are just chowing down on everyone's public data... which I personally don't mind to an extent but when they then sit down and turn around and sell it to everyone else the stuff they ingested for free... after actively knowing how much you are scraping, they really need to at least donate back to the sources they are scraping AT MINIMUM

Show thread

Dee in London Jul 29, 2024

@stefan I've over-ridden the robots.txt at my nginx load balancer for over 300 websites.

within the nginx server level I now have this:

location = /robots.txt {
        add_header Content-Type text/plain;
        return 200 "User-agent: *\nDisallow: /\n";
}

Am simply returning a disallow to all bots now... and backing that up with about a hundred blocks rules implemented against ASNs, user agents, specific IPs, and rate limits across the board.

Show thread

Harsh Shandilya Jul 29, 2024

@stefan I've been seeing a very similar pattern on my much smaller site: https://androiddev.social/@msfjarvis/112848867904557404

Harsh Shandilya (@[email protected])

Attached: 1 image I started redirecting #AI crawlers away based on their user agents and the 307s are now the bulk of my traffic. I sure do love the AI revolution

Android Dev Social

Show thread

SeanOMik Jul 29, 2024

@stefan I never thought about the cost that AI crawling causes for the hoster. So not only does the AI bot steal content, it costs that company money and gives nothing in return.

@baris

Haha no surprises there, I had to block a lot of these bots as well since they literally DDoS you.

Show thread

Andrico Jul 29, 2024

Unreal that people are paying to have their content stolen

Show thread

Jan Wildeboer 😷

Jul 29, 2024

@stefan I hope they make their blocklist available, so that others can also limit that kind of traffic :)

Show thread

Kelsey Jordahl Jul 29, 2024

@stefan so, hypothetically speaking, what if they just served made up garbage to those clients?

Show thread

Doktor Overcomma

Jul 29, 2024

@stefan
Cumulative traffic for cromwell-intl.com plus toilet-guru.com running on one FreeBSD host in the Google Cloud has been averaging a little over $1/day in outbound traffic over the past year. On July 2, I added some commonly recommended AI-bot-blocking to robots.txt. A week later, traffic had dropped to a little under 50% of what it had been.

Blue in top of each = traffic Americas -> Americas
Yellow, #3 in each = traffic Americas -> EMEA
Purple at bottom = offset for free CPU/RAM tier

Show thread

Doktor Overcomma

Aug 3, 2024

@stefan
[narrator voice]
Only later did he realize that he had broken a Python dependency for his two tootbots early on July 9th.

Show thread

Dec [{(

)}]Aug 1, 2024

@stefan
Dark Visitors has an API for updating .htaccess to keep up with the ever-changing AI user agents (freemium service):
https://darkvisitors.com/docs/robots-txt

Set Up Automatic Robots.txt | Dark Visitors

How to get started with automatic robots.txt

Dark Visitors