Yesterday my VPS set off a warning, as it was hit by a huge spike in incoming traffic, peaking at 55GB at 2:15pm and lasting for an hour.

Upon investigating, it turns out it was my PeerTube instance that was targeted.

Where did the traffic come from?

meta-externalagent (aka Meta's web crawler which is used to grab content to train its AI system).

I feel a little bit violated thinking my Fediverse promo video was grabbed by it, sigh.

#AIcritic #NoAI

@_elena I was forced to take down my SearXNG instance because of these stupid bots.
@ml @_elena I have mine behind Aurelia so only I can use my own searXNG

@andypiper @ml @_elena

I got hit by this as well last week, 30% of all hits from the bot in the last 14 days.

I've not had any response from the email address they published on their bot page, so all those requests are getting 301'd to 100GiB gzip bomb for now

https://blog.hardill.me.uk/2026/03/12/wtf-facebook-doing/

@andypiper @ml @_elena my searxng has been fine so far (at least to my knowledge), but thanks for the heads up, i should really put it behind my sso!
@kate @andypiper @_elena I was running a public instance. Also didn't use Cloudflare as requested by some users.
@ml @andypiper @_elena mine is public and i also don't use cloudflare (just my own vps with wireguard for tunneling the traffic)
@ml @_elena when I setup my own instance I also setup anubis and all IPs that fail that get added to firewall drop list for 90 days
@_elena You can block such AI crawlers, either with a robots.txt file. If the crawlers don't comply, you can also use Fail2Ban
@jools @_elena
Oh you sweet innocent child…

@RealGene

Never judge people you know nothing about... 😉

@_elena

@_elena on the plus side, Meta's LLMs are so gullible they might start extolling the Fediverse.

@_elena

I recently updated robots.txt regarding "AI" crawlers. And already had some crawlers on the black list. Maybe more should be blocked on an earlier stage.

@nick
Which entries did you add?

Could you post an example? That would definitely be helpful for others as well.

@_elena

@juergen

Sure.

I checked my logs and did some research and glued this together.

Keep in mind, that robots.txt is not blocking anything.


# Block "AI" crawlers, disallow "AI" training
User-Agent: AI2Bot
User-Agent: AI2Bot-Dolma
User-Agent: aiHitBot
User-Agent: Amazonbot
User-agent: anthropic-ai
User-Agent: Applebot
User-Agent: Applebot-Extended
User-Agent: AwarioBot
User-Agent: AwarioSmartBot
User-Agent: AwarioRssBot
User-Agent: Bytespider
User-Agent: CCBot
User-agent: ChatGPT-User
User-Agent: ClaudeBot
User-Agent: Claude-User
User-Agent: Claude-SearchBot
User-agent: Claude-Web
User-Agent: cohere-ai
User-Agent: cohere-training-data-crawler
User-Agent: Cotoyogi
User-Agent: DataForSeoBot
User-Agent: diffbot
User-agent: Diffbot
User-Agent: DuckAssistBot
User-Agent: Facebookbot
User-Agent: FacebookBot
User-Agent: Factset_spyderbot
User-Agent: FirecrawlAgent
User-Agent: Google-CloudVertexBot
User-Agent: Google-Extended
User-Agent: GPTBot
User-Agent: ICC-Crawler
User-Agent: ImagesiftBot
User-Agent: img2dataset
User-Agent: Kangaroo Bot
User-Agent: Meltwater
User-Agent: Meta-ExternalAgent
User-Agent: Meta-ExternalFetcher
User-Agent: OAI-SearchBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: PanguBot
User-Agent: peer39_crawler
User-Agent: PerplexityBot
User-Agent: Perplexity-User
User-Agent: Petalbot
User-Agent: Scrapy
User-Agent: Seekr
User-Agent: SemrushBot-OCOB
User-Agent: Sentibot
User-Agent: webzio-extended
User-Agent: TikTokSpider
User-Agent: Timpibot
User-Agent: TurnitinBot
User-Agent: VelenPublicWebCrawler
User-Agent: Youbot
Disallow: /
DisallowAITraining: /

User-Agent: *
DisallowAITraining: /
Content-Usage: ai=n
Allow: /

@_elena

dang I hope I didn't trigger anything by sharing your video on Facebook. I'm just trying to get some friends and family to come to the fediverse and hopefully delete Facebook (again).
@_elena
I can imagine that is a terrible feeling.
@_elena not sure if you've seen this https://bluetoot.hardill.me.uk/@ben/116243885816341998, I particularly like his response of using a 301 redirect to a massive file!
Ben Hardill (@[email protected])

@[email protected] @[email protected] @[email protected] I got hit by this as well last week, 30% of all hits from the bot in the last 14 days. I've not had any response from the email address they published on their bot page, so all those requests are getting 301'd to 100GiB gzip bomb for now https://blog.hardill.me.uk/2026/03/12/wtf-facebook-doing/

Mastodon

@_elena ugh. That’s just so aggravating. I have read several people mention that the meta bot is being aggressive and crashing sites.

That they can so blatantly steal data is just…

Really hope that the eu is going to do something about their theft.

@_elena Maybe Meta's AI bots might finally start giving people good advice.
@_elena I have had that bot trapped in an iocaine maze for the last week or so on my selfhosted forgejo instance. It is relentless.
@_elena Ew. Gross. I feel icky and violated just reading this.
@_elena They’re doing that on purpose. My hosting provider has already contacted me to say that my site (SearxNG) is causing major traffic issues. Because of this, many small instances may have to be taken offline again. It’s like a digital war...
just curious, was your searxng instance listed on searx.space? Do you know if many people were using it? I've just set it up recently and so far no issues, but it's mostly just being used by my wife, son, and myself.
@sam At first, there were no problems. I had only been using the instance myself; however, I hadn’t specifically hidden it. Then one day, my hosting provider contacted me to say that an enormous amount of traffic was passing through the instance... So I deactivated it. I later found similar cases online.

@_elena that’s frustrating — especially when it spikes traffic like that without warning.

I’m a Linux/Windows system administrator, and this kind of load can be managed. You can limit or block such crawlers and also protect your VPS with anti-DDoS, rate limiting, and traffic filtering.

If you want, I can help you secure and optimize your setup — or we can provide a VPS with built-in protection.

@_elena I had a similar experience with my AWS hosted websites, but from Chinese crawlers. 😞

@_elena

Can't understand much of this thread, but get the gist. Seems like the rebel alliance at work. You guys are wonderful!

@_elena I've been considering setting up a PeerTube site for my personal videos. Is there any defense against Ai bots doing a DDOS? Can they be pre-perma-banned?

@_elena Would you be able to use a user agent block list like ai.robots.txt? I have a cron job that updates it daily from their git repo and then restarts nginx.

Except I strip out the part that refers known agents to robots.txt and just give them a 403, because none of them ever honor the robots file anyway.

https://github.com/ai-robots-txt/ai.robots.txt

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

@_elena that's so stupid... 🫤

Here's a repo that blocks AI crawlers on webserver level, in this case Apache: https://codeberg.org/creatura85/htaccess
There's probably a similar repo for ngix as well?

htaccess

Keeping AI companies from copyright-violating a website for LLM training is difficult, but not impossible. I got pretty far using Apache .htaccess.

Codeberg.org