Yesterday my VPS set off a warning, as it was hit by a huge spike in incoming traffic, peaking at 55GB at 2:15pm and lasting for an hour.

Upon investigating, it turns out it was my PeerTube instance that was targeted.

Where did the traffic come from?

meta-externalagent (aka Meta's web crawler which is used to grab content to train its AI system).

I feel a little bit violated thinking my Fediverse promo video was grabbed by it, sigh.

#AIcritic #NoAI

@_elena

I recently updated robots.txt regarding "AI" crawlers. And already had some crawlers on the black list. Maybe more should be blocked on an earlier stage.

@nick
Which entries did you add?

Could you post an example? That would definitely be helpful for others as well.

@_elena

@juergen

Sure.

I checked my logs and did some research and glued this together.

Keep in mind, that robots.txt is not blocking anything.


# Block "AI" crawlers, disallow "AI" training
User-Agent: AI2Bot
User-Agent: AI2Bot-Dolma
User-Agent: aiHitBot
User-Agent: Amazonbot
User-agent: anthropic-ai
User-Agent: Applebot
User-Agent: Applebot-Extended
User-Agent: AwarioBot
User-Agent: AwarioSmartBot
User-Agent: AwarioRssBot
User-Agent: Bytespider
User-Agent: CCBot
User-agent: ChatGPT-User
User-Agent: ClaudeBot
User-Agent: Claude-User
User-Agent: Claude-SearchBot
User-agent: Claude-Web
User-Agent: cohere-ai
User-Agent: cohere-training-data-crawler
User-Agent: Cotoyogi
User-Agent: DataForSeoBot
User-Agent: diffbot
User-agent: Diffbot
User-Agent: DuckAssistBot
User-Agent: Facebookbot
User-Agent: FacebookBot
User-Agent: Factset_spyderbot
User-Agent: FirecrawlAgent
User-Agent: Google-CloudVertexBot
User-Agent: Google-Extended
User-Agent: GPTBot
User-Agent: ICC-Crawler
User-Agent: ImagesiftBot
User-Agent: img2dataset
User-Agent: Kangaroo Bot
User-Agent: Meltwater
User-Agent: Meta-ExternalAgent
User-Agent: Meta-ExternalFetcher
User-Agent: OAI-SearchBot
User-Agent: Omgili
User-Agent: Omgilibot
User-Agent: PanguBot
User-Agent: peer39_crawler
User-Agent: PerplexityBot
User-Agent: Perplexity-User
User-Agent: Petalbot
User-Agent: Scrapy
User-Agent: Seekr
User-Agent: SemrushBot-OCOB
User-Agent: Sentibot
User-Agent: webzio-extended
User-Agent: TikTokSpider
User-Agent: Timpibot
User-Agent: TurnitinBot
User-Agent: VelenPublicWebCrawler
User-Agent: Youbot
Disallow: /
DisallowAITraining: /

User-Agent: *
DisallowAITraining: /
Content-Usage: ai=n
Allow: /

@_elena