My entire website is 44MB in size. Most of this are images of course.

Yesterday 1,2GB of data was transferred from my webserver, including EVERY image several dozen times.

Either a lot of people discovered my blog last week and spent the whole day reading ALL of my posts, or there's some AI scraping going on again.

I'd hate to do it, but I'm seriously considering putting some Anubis or Cloudflare protection in front of it. Stuff like this really pisses me off...

@82mhz

With images, you could keep them from loading if they don't have the HTTP referrer header set to something on the domain? That would prevent hotlinking and probably most scraping.

@amin
How would I do that? Sorry for the stupid question, but I'm not a web hosting expert, I truly don't know...

@82mhz

Hmm, what's your hosting setup like? Where/how do you host your site?

@amin
I'm on a small webhoster and I have a webspace there... so I'm not in control of how the server is configured. So if it isn't something I can set in the html on my end or through some modules I can load in htaccess, I think I'm out of luck there...

@82mhz

If you don't have access to the reverse proxy I don't think you'd be able to set up Anubis anyway. :(

Hm, there might be a way to do that hotlink protection thing in htaccess, though! Been a long time since I've written one, but it's the kind of thing they do well… some sort of rewrite rule for requests to certain file extensions (ie images) that don't have referrers?

@82mhz

You can definitely block access to certain user agents (ie known LLM crawlers) that way.

@amin
I'll look into this, thanks for the suggestion!
I was also recommended this repo which I implemented:
https://github.com/ai-robots-txt/ai.robots.txt
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

@82mhz

Neil Clarke's article has recommendations on blocking via .htaccess, worth a look. https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Block the Bots that Feed “AI” Models by Scraping Your Website – Neil Clarke

@82mhz

If it's running Apache (which is what uses the htaccess files) and you have the rewrite module enabled (it's pretty common), you should be able to create a .htaccess file containing something like


RewriteEngine on

RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?82mhz.net [NC]

RewriteRule \.(jpg|jpeg|png|gif|svg)$ - [NC,F,L]

which should check the condition "if the refererer is anything other than your site" ([NC] = case insensitive), and then implement the rule that any request matching image filetypes (case insensitive) should be "F"orbidden, and that rule-processing for this should stop here rather than continuing to look for other rules ([L])

If the server isn't Apache, you'd have to suss out what it is running, and use the corresponding rewrite rules for that.

@amin

@gumnos
That's cool, I gotta give this a try! Thanks for the suggestion :)

@amin

@gumnos @82mhz

Brilliantly done, thanks for this. :)

@82mhz

The problem with Anubis is RSS—if you let it through, LLM scrapers can (maybe) just parse it. If you don't, genuine users are gonna have problems.

The default config mostly only bothers known bad actors or regular web browsers, though, and leaves feed readers and terminal web browsers alone, which I like.

@82mhz
Could you just use something that slows down scrapers?

@82mhz

I don't have a lot of photos on mine, but #AVIF image compression helps keep the file sizes crazy low while still looking acceptable.

@82mhz I think I saw your website recently on some mastodon post. The infoboxes in Fediverse software are downloaded by every participating server on their own (depending on followers, hashtags etc that can throw a wide net), which can amount to a lot of traffic.

I documented my Mastodon-stampede-optimization-with-a-side-of-AI-blockage at https://patrick.georgi-clan.de/posts/caching-mastodon-preview-card-responses/, although that still won't reduce the traffic, just the CPU overhead for creating the same response a thousand times...

For further optimization, the server would have to send fediverse servers an optimized response that only contains OpenGraph information (which is all they care about), which is more involved...

Caching Mastodon Preview Card Responses

I ran into two instances recently where people remarked that the Fediverse can be a bit of a Distributed Denial of Service attack: When posts link to an URL, some Fediverse software helpfully tries to collect some metadata from the page to show a preview card, like any modern social media software is supposed to do. The problem is that in the Fediverse, the post gets replicated to all servers that are supposed to receive the post, through subscriptions or reposts, and every single one of these servers will download the same file for the same data, usually within a very short period of time.

Personal Ramblings

@patrick
Hi, thanks for the suggestion! I heard this phenomenon being called the "Mastodon hug of death" 😄

I don't know why they don't just implement a random delay before fetching the metadata, that would immediately mitigate the problem of hundreds or thousands of instances hammering a server at the same time, but whatever.

This has not been an issue for me so far, I guess my webhoster has enough capacity and my account is quite small so there's not too many Mastodon instances showing up, I think it was less than 200 last time I checked.

Andre suggested blocking AI bots via .htaccess which is similar to what you're doing as far as I can tell:

https://fedi.jaenis.ch/@andre/statuses/01JY8TCDWF2PA4QFC0NQMF4CWH

@82mhz @patrick you know, I just thought about this phenomenon today. Wouldn't it be super easy to generate the link preview once on the origin server of the post and all other instances grab it from there or link directly to it? Somewhat similar to how pictures are handled.
It wouldn't be compatible with all servers right away, but it sounds like the obvious solution to me and shouldn't be too hard to implement.
@irgndsondepp @patrick
Could be easy I guess, but this problem has existed in Mastodon for years and nothing is being done to fix it, so I guess it's not much of a priority -.-