OpenAI's crawler just found our family server / cloud services and immediately proceeded to crash Nextcloud within minutes. Fucking fantastic.

Is there some nice, up-to-date write-up on the different tools to protect yourself against this?
#AI #AISlop #AttackOfTheMachines #selfHosting

@Natanox most use iodine nowadays.
@kura @Natanox iocaine
@Natanox @lunareclipse close. sorry. that one it was. i am not using anything for now. ai has mostly spared me up to now.
@kura @Natanox yeah I haven't needed these either because my services are either private and login-walled or just a static site

nevertheless this looks like a good read for the OP
xaselgio.net/posts/26.poisoning-knowledge/
Poisoning the knowledge

A rant about current state of the internet (LLM crawlers), and some observations & conclusions, along with some techniques to help you protect your own services.

Indigo's den

@lunareclipse @kura Ours are also login-walled, but OpenAI crashed our worker threads nevertheless somehow.

Thanks for the link! 

@Natanox load on my server reduced massively once it was no longer reachable via IPv4. Seems the slobsters aren't as progressive as they always claim to be. 😁
@marix @Natanox oh yea, come to think of it when I fixed v4 for my site was when it kicked off...
@Natanox I don't know of a write up comparison. There's AI.robots.txt which will give yih a list of user agents to block and a robots.txt - for what it's worth. I hear they're just solving js challenges now so not sure Anubis will cut it. Iocaine I'm running and it seems good, have had meta and now claude stuck in the tarpit for a week plus but then again it seems like nowhere near ddos levels others have gotten
@Natanox oh yea the ddos tier stuff people have had substantial servers overwhelmed so dropping traffic in the network stack might be required - iirc they usually look for the ASN of the company and add a rule for that. Idk if anyone has wired up layer 7 detection to fail2ban yet
@arichtman @Natanox it doesn't matter if they solve a POW challenge, the point is slowing them down and making it expensive, much like captchas
@Natanox iocaine and anubis
@nelle @Natanox (Anubis is probably closer to what you would want to reduce server load specifically)
@lunareclipse @Natanox yeah, iocaine is more about allowing them in and then poisoning the datasets

@nelle @lunareclipse Out of curiosity: Would it also poison "respecting" crawlers that read the robots.txt? πŸ€”

I noticed an *immense* difference in company behaviour between the US shitheads (which all behave maliciously), data brokers (also highly maliciously) and e.g. EU companies who seem to actually abort if instructed.

In general I'm more than happy to serve some malicious compliance to those dickheads, our server should have the resources for that.

@Natanox It's a bit of work, but I'd suggest something like #NetBird or #tailscale to keep your private things private.

The only real downside I see so far is that on mobile devices (iOS in my case) it increases battery consumption to a noticeable degree.

@Fishd Not all family members are inclined to install these tools everywhere, and it would cause e.g. Nextcloud password-protected Share Links to stop working for anyone we want to send things to.

I'll probably go with something like Anubis + iocane instead.

@Natanox Fair points.

My concern with those tools are, you're just playing whack-a-mole ... and your opponent has more resources than you and is sufficiently motivated (by the way of investment capital) to defeat you.

Similarly for those folks suggesting fighting back by 'poisoning the well' ... that assumes you've the spare compute power and significant energy supplies.

@Fishd Also fair points.

Though these tools that fight back are a community effort, so there's a lot of brainpower going in there as well. Basically the giant well-funded army with only a few bright minds against a motivated guerilla force, but digital.

It's a shitty situation, but for now I want to keep our infrastructure as accessible as possible to convince some family members that corpo clouds indeed aren't inherently better.

@Natanox @Fishd
Robots.txt doesn't help?
OpenAI seems to be publishing even lists of ipadresses so you can block it https://developers.openai.com/api/docs/bots
Overview of OpenAI Crawlers

@FoxVK @Fishd If AI crawlers found us I'll rather directly go with things like Anubis. Even *if* some of them care enough, a sufficient amount of them do not or even actively circumvent blocks (e.g. Anthropic is known for that).

To be fair though, I didn't check the default robots.txt if the Nextcloud docker before so this might've been preventable. But at least now I know the bad actors will get to us soon.

@Natanox it/they haven't got round to mine yet but I too will be interested to see what answers there are, as it's only a matter of time. My family/friends Nextcloud gets the top 'A+' security rating from the Nextcloud scanner thing web page but beyond that what to do I don't know either.

@nigelharpur I learned today that the robots.txt that comes with a default Nextcloud install (at least in a docker) is perfectly basic. I guess a basic measure would be to replace that one with a robots.txt that includes all the anti-AI stuff telling OpenAI and others to screw off.

Other than that most answers mentioned Anubis and its alternatives as well as poisoning tools like iocaine.

@Natanox I can't recommend, but will someone please comment with things to look out for?

Obviously, I can look for average traffic off the chart, but does the creepy-crawler traffic come from one IP address, or a narrow range? Does it deliberately ignore robots.txt ? These things could trigger a goooo_sloooow, or return a "Go stick your head in a pig" page, or something.

@Natanox Anubis is great, also try blocking data center IPs and known bot IPs. I am currently on a train, but as soon as I get home I’ll post links to some useful services

@Natanox here are the resources I promised:

Database of IP reputation: https://nerd.cesnet.cz/

Lots of useful categorized IP lists: https://iplists.firehol.org/

https://www.blocklist.de/en/index.html

NERD - Network Entity Reputation Database