I now have a multi-tiered approach to blocking AI bots on my infrastructure:

1) robots.txt - Ha, they don't fucking care.
2) iocaine -> https://iocaine.madhouse-project.org/ (poisons the bot with never ending HTTP content)
3) HTTP 426 for any HTTP/1* requests (tells legit browsers to upgrade to HTTP/2+)
4) Anubis -> https://anubis.techaro.lol/ (requires javascript proof-of-work)
5) Injecting kill strings as HTTP headers

Next layer is going to be prompt injection attacks into every resource served via comments in all the documents.

This is war.

#fuck_ai #fuck_with_ai #ai

iocaine - the deadliest poison known to AI

@reyjrar
Eek! The moment I decided to look at that link my internet connection died. 😳
(Now working again - interesting idea!)
@reyjrar 6) Block offenders on a network/ASN level.
@simondassow @reyjrar Sadly, won't work, as they've been coming from an insane amount of domestic IPs. There's probably some malware selling infected machines as proxies or similar.
@ainmosni @reyjrar ASNs limit this to selected offenders. But of course, when legitimate users hide in those, they'll be cut off as well.
@simondassow @reyjrar You got it backwards though, legitimate users aren’t hiding in there, the scrapers are. They’ve been hiding amongst legitimate users for quite a while now, which is why we need things like anubis to begin with.  If they used their own, predicable IP ranges, we could just block those or the ASNs they belonged to.
@ainmosni @reyjrar Even so, it's still a valid form of punishment addressed at the source I think. They slop, we stop.
@simondassow @reyjrar While I don't disagree, but if it also blocks a more than significant amount of legitimate innocent users, it becomes more complicated.

@simondassow As @ainmosni mentioned, AI scrapers are using proxy services to come from residential IP space when you block their ASN/IP blocks. There are companies like Zscaler that provide access to residential proxies under the guise of legitimacy.

I had to take a layered approach. The robots.txt tells them to go away. If they don't, but they identify themselves, they get fed to iocaine. If they fake their UA (Anthropic and OpenAI do) and come in over HTTP/1*, they get asked to upgrade to HTTP/2+. If they manage that (most do not), then Anubis does a PoW. (I am considering alternatives to Anubis that are behavioral based). Finally, all my sites inject kill string headers into every request.

Unfortunately, I can't find a curated list of AI kill strings and in testing, not every AI platform sees the headers. It looks like Anthropic and OpenAI have a non-AI layer that does fetching over HTTP/1.1 and returns just the body of the request to the agent. If I were designing that layer, I would strip all non-visible content from the body, so I'm not sure adding the fake content into HTML comments will make it through to agents themselves.

I think it would be interesting to extend iocaine to perform prompt injections at that layer. AFAICT, most AI companies _try_ to scrape the site honestly first. If that fails, they gradually add more and obfuscation because they need that sweet, sweet, content.

@reyjrar @ainmosni My point was that blocking specific ASNs that are owned by offending companies would be affected, which won't include residential ranges. But possibly cloud ranges.
@reyjrar @ainmosni It could be another layer, not saying it should replace other defences.