Mastodawn

Picknick an der Datenautobahn

Diese Woche wurde ich von einer ungewöhnlichen Welle an Anfragen an meinen Server überrascht. Erst dachte ich, dass ich irgendetwas falsch konfiguriert haben könnte, aber nach einem Gespräch mit dem Support von Uberspace war klar, dass mein WordPress Multisite-Setup mit dieser Seite Gefährliches Halbwissen und Um' Pudding bombardiert und somit überlastet wird. Als einfacher User eines Shared Hosting Dienstes kann man wenig dagegen tun, außer zu versuchen herauszufinden, was hier passiert und zugucken, wie die Seite auseinandergenommen wird. Witzigerweise musste ich dabei an ein Buch denken, welches ich 2017 gelesen habe.

https://niklasbarning.de/2025/12/02/picknick-an-der-datenautobahn/

Picknick an der Datenautobahn

niklasbarning

KI & Koffein 5d ago

Stau auf der Datenautobahn. Die Datensammelei der KI-Firmen nimmt mittlerweile solche Ausmaße an, dass sie richtige Verstopfungen verursachen.
#KI #kuenstlicheintelligenz #crawler

cutterkom Nov 28

Look who was here

#crawler

Frehi Nov 27

apache2-ai-bots: a Debian package which configures Apache to block AI crawlers.

https://packages.debian.org/sid/main/apache2-ai-bots

After installing this package, you need to run
# a2enconf block-ai-bots
# systemctl reload apache2
to load it.

Unfortunately, it also blocks access to robots.txt, which I think it should not, because some of these bots will stop crawling if you instruct them to do so in robots.txt.

#apache #Debian #AI #crawler #bot

Debian -- Details of package apache2-ai-bots in sid

list of AI agents and robots to block (apache2)

Tomas Norre

Nov 26

#TYPO3 #Crawler has a v12.x branch now. I will start making the main branch compatible with TYPO3 14. I'll drop support for TYPO3 12 in the main branch.

#HappyCrawling

Frehi Nov 26

RewriteCond %{HTTP:Accept-Language} zh
RewriteCond %{HTTP:Connection} "keep-alive, close"
RewriteRule ^.* - [F,L]

Blocks a bot disguising as a normal browser.

#bot #crawler #apache

Benny Nov 17

Was ist denn da seit ein paar Tagen für ein
#Crawler auf meiner Webseite unterwegs? So viele Connections vom Webserver sehe ich nicht immer.

Mal schauen, wann der durch ist. Laut Check der IPs: CHINANET, 21ViaNet(China),Inc., Tencent cloud computing (Beijing)
#China #Webcrawler

Downshift 🍁Nov 17

How to block AI Crawler Bots using robots.txt file

#robots #ai #badbots #crawler

https://www.cyberciti.biz/web-developer/block-openai-bard-bing-ai-crawler-bots-using-robots-txt-file/

How to block AI Crawler Bots using robots.txt file

Here is how to block generative AI (OpenAI ChatGPT, Google Bard, CCBot Crawler bots) using robots.txt to protect your content.

nixCraft

indigo Nov 5

I think what a lot of people don't really realize is how aggressive AI is trying to crawl the web. Forget about robots.txt. If you look at this article, it gives you an idea of how complicated it is to prevent it. https://journal.code4lib.org/articles/18489

Especially when you think about how crawlers are getting extremely sophisticated. Here is another example of how the Duke University Library tried to defend this aggressive crawling.l

https://dukespace.lib.duke.edu/server/api/core/bitstreams/816ef134-55cf-49f6-9a8b-1e8a2324b1ff/content

#ai #web #crawler

The Code4Lib Journal – Mitigating Aggressive Crawler Traffic in the Age of Generative AI: A Collaborative Approach from the University of North Carolina at Chapel Hill Libraries

Just Jim Oct 30

"Mitigating Aggressive #Crawler Traffic in the Age of Generative #AI: A Collaborative Approach from the University of North Carolina at Chapel Hill #Libraries" - code{4}lib Journal

#bot #code4lib

https://journal.code4lib.org/articles/18489