Mastodawn

Jonathan Corbet Jan 21, 2025

Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.

This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.

Happy new year :)

Show thread

Adelie Jan 21, 2025

@corbet @LWN

"Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "

if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).

Just set it up last night for my much-smaller-traffic instances, feel free to DM

https://ipset.netfilter.org/

IP sets

ipset

Show thread

Jonathan Corbet

@adelie @LWN Blocking a subnet is not hard; the harder part is figuring out *which* subnets without just blocking huge parts of the net as a whole.

Show thread

K. Ryabitsev-Prime 🍁Jan 21, 2025

@corbet @adelie @LWN I have been using pyasn to block entire subnets. It's effective, but only in the same way carpet bombing is. I'm sure I've blocked legitimate systems, but c'est la vie.

Show thread

Adelie Jan 21, 2025

@corbet @LWN

Probably a good question for the fedi as a whole. I started with any 40x response in my logs, added any spamhaus hits from my mail server, and any user-agents with "bot" in the name. Plus facebook in particular has huge ipv4 blocks just for scraping, also easy to block.

Show thread

Adelie Jan 21, 2025

@corbet @LWN

Also tarpits! And nepenthes and nepenthes-adjacent tech!

https://tldr.nettime.org/@asrg/113867412641585520

https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62fdded3

ASRG (@[email protected])

Attached: 1 image ## **Sabot in the Age of AI** A list of offensive methods & strategic approaches for facilitating (algorithmic) sabotage, framework disruption, & intentional data poisoning. ### **Selected Tools & Frameworks** - **Nepenthes** — [Endless crawler trap.](https://zadzmo.org/code/nepenthes) - **Babble** — [Standalone LLM crawler tarpit.](https://git.jsbarretto.com/zesterer/babble) - **Markov Tarpit** — [Traps AI bots & feeds them useless data.](https://git.rys.io/libre/markov-tarpit) - **Sarracenia** — [Loops bots into fake pages.](https://github.com/CTAG07/Sarracenia) - **Antlion** — [Express.js middleware for infinite sinkholes.](https://github.com/shsiena/antlion) - **Infinite Slop** — [Garbage web page generator.](https://code.blicky.net/yorhel/infinite-slop) - **Poison the WeLLMs** — [Reverse proxy for LLM confusion.](https://codeberg.org/MikeCoats/poison-the-wellms) - **Marko** — [Dissociated Press CLI/lib.](https://codeberg.org/timmc/marko/) - **django-llm-poison** — [Serves poisoned content to crawlers.](https://github.com/Fingel/django-llm-poison) - **konterfAI** — [Model-poisoner for LLMs.](https://codeberg.org/konterfai/konterfai) - **Quixotic** — [Static site LLM confuser.](https://marcusb.org/hacks/quixotic.html) - **toxicAInt** — [Replaces text with slop.](https://github.com/portasynthinca3/toxicaint) - **Iocaine** — [Defense against unwanted scrapers.](https://iocaine.madhouse-project.org) - **Caddy Defender** — [Blocks bots & pollutes training data.](https://defender.jasoncameron.dev) - **GzipChunk** — [Inserts compressed junk into live gzip streams.](https://github.com/gw1urf/gzipchunk) - **Chunchunmaru** — [Go-based web scraper tarpit.](https://github.com/BrandenStoberReal/Chunchunmaru) - **IED** — [ZIP bombs for web scrapers.](https://github.com/NateChoe1/ied) - **FakeJPEG** — [Endless fake JPEGs.](https://github.com/gw1urf/fakejpeg) - **Pyison** — [AI crawler tarpit.](https://github.com/JonasLong/Pyison) - **HalluciGen** — [WP plugin that scrambles content.](https://codeberg.org/emergentdigitalmedia/HalluciGen) - **Spigot** — [Hierarchical Markov page generator.](https://github.com/gw1urf/spigot) --- *This is a living resource — regularly updated to reflect the shifting terrain of collective techno-disobedience and algorithmic Luddism.*

tldr.nettime

Show thread

Adelie Jan 21, 2025

@corbet @LWN You know, what we need is a clearinghouse for this like there are for piholes and porn and such. Could someone with some followers get #AIblacklist trending?

Post your subnets with that hashtag. If we get any traction, I'll host the list.