Mastodawn

Yesterday my VPS set off a warning, as it was hit by a huge spike in incoming traffic, peaking at 55GB at 2:15pm and lasting for an hour.

Upon investigating, it turns out it was my PeerTube instance that was targeted.

Where did the traffic come from?

meta-externalagent (aka Meta's web crawler which is used to grab content to train its AI system).

I feel a little bit violated thinking my Fediverse promo video was grabbed by it, sigh.

#AIcritic #NoAI

Show thread

Mitex Leo Mar 17

@_elena I was forced to take down my SearXNG instance because of these stupid bots.

Show thread

Andy Piper Mar 17

@ml @_elena I have mine behind Aurelia so only I can use my own searXNG

Show thread

Ben Hardill Mar 17

@andypiper @ml @_elena

I got hit by this as well last week, 30% of all hits from the bot in the last 14 days.

I've not had any response from the email address they published on their bot page, so all those requests are getting 301'd to 100GiB gzip bomb for now

https://blog.hardill.me.uk/2026/03/12/wtf-facebook-doing/

Show thread

John Faithfull 🌍🇪🇺🏴󠁧󠁢󠁳󠁣󠁴󠁿🧡✊🏻✊🏿Mar 17

@ben @andypiper @ml @_elena
> a blanket 301 redirect to a 100GiB gzip bomb

Nice 👌 👍 👏

Show thread

KATastrophE Mar 17

@andypiper @ml @_elena my searxng has been fine so far (at least to my knowledge), but thanks for the heads up, i should really put it behind my sso!

Show thread

Mitex Leo Mar 17

@kate @andypiper @_elena I was running a public instance. Also didn't use Cloudflare as requested by some users.

Show thread

KATastrophE Mar 17

@ml @andypiper @_elena mine is public and i also don't use cloudflare (just my own vps with wireguard for tunneling the traffic)

Show thread

Bret Towe Mar 17

@ml @_elena when I setup my own instance I also setup anubis and all IPs that fail that get added to firewall drop list for 90 days

Show thread

Jools Mar 17

@_elena You can block such AI crawlers, either with a robots.txt file. If the crawlers don't comply, you can also use Fail2Ban

Show thread

RealGene ☣️Mar 17

@jools @_elena
Oh you sweet innocent child…

Show thread

Jools Mar 17

@RealGene

Never judge people you know nothing about... 😉

@_elena

Show thread

Chuckles Mar 17

@_elena on the plus side, Meta's LLMs are so gullible they might start extolling the Fediverse.

Show thread

Nordnick 🌐Mar 17

@_elena

I recently updated robots.txt regarding "AI" crawlers. And already had some crawlers on the black list. Maybe more should be blocked on an earlier stage.

Show thread

Juergen M. Bruckner Mar 17

@nick
Which entries did you add?

Could you post an example? That would definitely be helpful for others as well.

Sure.

I checked my logs and did some research and glued this together.

Keep in mind, that robots.txt is not blocking anything.

# Block "AI" crawlers, disallow "AI" training
User-Agent: AI2Bot
User-Agent: AI2Bot-Dolma
User-Agent: aiHitBot
User-Agent: Amazonbot
User-agent: anthropic-ai
User-Agent: Applebot
User-Agent: Applebot-Extended
User-Agent: AwarioBot
User-Agent: AwarioSmartBot
User-Agent: AwarioRssBot
User-Agent: Bytespider
User-Agent: CCBot
User-agent: ChatGPT-User
User-Agent: ClaudeBot
User-Agent: Claude-User
User-Agent: Claude-SearchBot
User-agent: Claude-Web
User-Agent: cohere-ai
User-Agent: cohere-training-data-crawler
User-Agent: Cotoyogi
User-Agent: DataForSeoBot
User-Agent: diffbot
User-agent: Diffbot
User-Agent: DuckAssistBot
User-Agent: Facebookbot
User-Agent: FacebookBot
User-Agent: Factset_spyderbot
User-Agent: FirecrawlAgent
User-Agent: Google-CloudVertexBot
User-Agent: Google-Extended
User-Agent: GPTBot
User-Agent: ICC-Crawler
User-Agent: ImagesiftBot
User-Agent: img2dataset
User-Agent: Kangaroo Bot
User-Agent: Meltwater
User-Agent: Meta-ExternalAgent
User-Agent: Meta-ExternalFetcher
User-Agent: OAI-SearchBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: PanguBot
User-Agent: peer39_crawler
User-Agent: PerplexityBot
User-Agent: Perplexity-User
User-Agent: Petalbot
User-Agent: Scrapy
User-Agent: Seekr
User-Agent: SemrushBot-OCOB
User-Agent: Sentibot
User-Agent: webzio-extended
User-Agent: TikTokSpider
User-Agent: Timpibot
User-Agent: TurnitinBot
User-Agent: VelenPublicWebCrawler
User-Agent: Youbot
Disallow: /
DisallowAITraining: /

User-Agent: *
DisallowAITraining: /
Content-Usage: ai=n
Allow: /

@_elena

Show thread

sam Mar 17

dang I hope I didn't trigger anything by sharing your video on Facebook. I'm just trying to get some friends and family to come to the fediverse and hopefully delete Facebook (again).

Show thread

MFierst Mar 17

@_elena
I can imagine that is a terrible feeling.

Show thread

RichBartlett

Mar 17

@_elena not sure if you've seen this https://bluetoot.hardill.me.uk/@ben/116243885816341998, I particularly like his response of using a 301 redirect to a massive file!

Ben Hardill (@[email protected])

@[email protected] @[email protected] @[email protected] I got hit by this as well last week, 30% of all hits from the bot in the last 14 days. I've not had any response from the email address they published on their bot page, so all those requests are getting 301'd to 100GiB gzip bomb for now https://blog.hardill.me.uk/2026/03/12/wtf-facebook-doing/

Mastodon

Show thread

Sylvia Mar 17

@_elena ugh. That’s just so aggravating. I have read several people mention that the meta bot is being aggressive and crashing sites.

That they can so blatantly steal data is just…

Really hope that the eu is going to do something about their theft.

Show thread

狐ヴィクシー Mar 17

@_elena Maybe Meta's AI bots might finally start giving people good advice.

Show thread

Alex Wilson Mar 17

@_elena I have had that bot trapped in an iocaine maze for the last week or so on my selfhosted forgejo instance. It is relentless.

Show thread

Marian Scales Mar 17

@_elena Ew. Gross. I feel icky and violated just reading this.

Show thread

Thom Mar 17

@_elena They’re doing that on purpose. My hosting provider has already contacted me to say that my site (SearxNG) is causing major traffic issues. Because of this, many small instances may have to be taken offline again. It’s like a digital war...

Show thread

sam Mar 17

just curious, was your searxng instance listed on searx.space? Do you know if many people were using it? I've just set it up recently and so far no issues, but it's mostly just being used by my wife, son, and myself.

Show thread

Thom Mar 17

@sam At first, there were no problems. I had only been using the instance myself; however, I hadn’t specifically hidden it. Then one day, my hosting provider contacted me to say that an enormous amount of traffic was passing through the instance... So I deactivated it. I later found similar cases online.

Show thread

RootHosts Mar 17

@_elena that’s frustrating — especially when it spikes traffic like that without warning.

I’m a Linux/Windows system administrator, and this kind of load can be managed. You can limit or block such crawlers and also protect your VPS with anti-DDoS, rate limiting, and traffic filtering.

If you want, I can help you secure and optimize your setup — or we can provide a VPS with built-in protection.

Show thread

Robee? Na! 🌈Mar 17

@_elena I had a similar experience with my AWS hosted websites, but from Chinese crawlers. 😞

Show thread

Mastodon Migration Mar 17

@_elena

Can't understand much of this thread, but get the gist. Seems like the rebel alliance at work. You guys are wonderful!

Show thread

Scott Starkey Mar 17

@_elena I've been considering setting up a PeerTube site for my personal videos. Is there any defense against Ai bots doing a DDOS? Can they be pre-perma-banned?

Show thread

Ed Mar 17

@_elena Would you be able to use a user agent block list like ai.robots.txt? I have a cron job that updates it daily from their git repo and then restarts nginx.

Except I strip out the part that refers known agents to robots.txt and just give them a 403, because none of them ever honor the robots file anyway.

https://github.com/ai-robots-txt/ai.robots.txt

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub

Show thread

PaulH Mar 24

@_elena that's so stupid... 🫤

Here's a repo that blocks AI crawlers on webserver level, in this case Apache: https://codeberg.org/creatura85/htaccess
There's probably a similar repo for ngix as well?

htaccess

Keeping AI companies from copyright-violating a website for LLM training is difficult, but not impossible. I got pretty far using Apache .htaccess.

Codeberg.org