Mastodawn

I now have a multi-tiered approach to blocking AI bots on my infrastructure:

1) robots.txt - Ha, they don't fucking care.
2) iocaine -> https://iocaine.madhouse-project.org/ (poisons the bot with never ending HTTP content)
3) HTTP 426 for any HTTP/1* requests (tells legit browsers to upgrade to HTTP/2+)
4) Anubis -> https://anubis.techaro.lol/ (requires javascript proof-of-work)
5) Injecting kill strings as HTTP headers

Next layer is going to be prompt injection attacks into every resource served via comments in all the documents.

This is war.

#fuck_ai #fuck_with_ai #ai

iocaine - the deadliest poison known to AI

Show thread

Simon Dassow Mar 15

@reyjrar 6) Block offenders on a network/ASN level.

Show thread

Brad L.

@simondassow As @ainmosni mentioned, AI scrapers are using proxy services to come from residential IP space when you block their ASN/IP blocks. There are companies like Zscaler that provide access to residential proxies under the guise of legitimacy.

I had to take a layered approach. The robots.txt tells them to go away. If they don't, but they identify themselves, they get fed to iocaine. If they fake their UA (Anthropic and OpenAI do) and come in over HTTP/1*, they get asked to upgrade to HTTP/2+. If they manage that (most do not), then Anubis does a PoW. (I am considering alternatives to Anubis that are behavioral based). Finally, all my sites inject kill string headers into every request.

Unfortunately, I can't find a curated list of AI kill strings and in testing, not every AI platform sees the headers. It looks like Anthropic and OpenAI have a non-AI layer that does fetching over HTTP/1.1 and returns just the body of the request to the agent. If I were designing that layer, I would strip all non-visible content from the body, so I'm not sure adding the fake content into HTML comments will make it through to agents themselves.

I think it would be interesting to extend iocaine to perform prompt injections at that layer. AFAICT, most AI companies _try_ to scrape the site honestly first. If that fails, they gradually add more and obfuscation because they need that sweet, sweet, content.

Show thread

Simon Dassow Mar 16

@reyjrar @ainmosni My point was that blocking specific ASNs that are owned by offending companies would be affected, which won't include residential ranges. But possibly cloud ranges.

Show thread

Simon Dassow Mar 16

@reyjrar @ainmosni It could be another layer, not saying it should replace other defences.