My entire website is 44MB in size. Most of this are images of course.

Yesterday 1,2GB of data was transferred from my webserver, including EVERY image several dozen times.

Either a lot of people discovered my blog last week and spent the whole day reading ALL of my posts, or there's some AI scraping going on again.

I'd hate to do it, but I'm seriously considering putting some Anubis or Cloudflare protection in front of it. Stuff like this really pisses me off...

@82mhz

With images, you could keep them from loading if they don't have the HTTP referrer header set to something on the domain? That would prevent hotlinking and probably most scraping.

@amin
How would I do that? Sorry for the stupid question, but I'm not a web hosting expert, I truly don't know...

@82mhz

Hmm, what's your hosting setup like? Where/how do you host your site?

@amin
I'm on a small webhoster and I have a webspace there... so I'm not in control of how the server is configured. So if it isn't something I can set in the html on my end or through some modules I can load in htaccess, I think I'm out of luck there...

@82mhz

If you don't have access to the reverse proxy I don't think you'd be able to set up Anubis anyway. :(

Hm, there might be a way to do that hotlink protection thing in htaccess, though! Been a long time since I've written one, but it's the kind of thing they do wellโ€ฆ some sort of rewrite rule for requests to certain file extensions (ie images) that don't have referrers?

@82mhz

You can definitely block access to certain user agents (ie known LLM crawlers) that way.

@amin
I'll look into this, thanks for the suggestion!
I was also recommended this repo which I implemented:
https://github.com/ai-robots-txt/ai.robots.txt
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

@82mhz

Neil Clarke's article has recommendations on blocking via .htaccess, worth a look. https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Block the Bots that Feed โ€œAIโ€ Models by Scraping Your Website โ€“ Neil Clarke