Almost 29 million requests from AI crawlers defeated by essentially one simple check: if the user agent contains Chrome/ or Firefox/, and doesn't have sec-fetch-mode, it's going into the maze.

Billions of dollars poured into AI, yet, their crawlers are broken by two ifs in an nginx config.

If this all wasn't so sad, I'd laugh.

Post by iocaine powder, @[email protected]

#iocaine has been up for 4days 15h 5m 28s, and spent 1day 6h 26m 31s dealing with - *gestures hands wildly* - [everything](https://monitor.madhouse-project.o…

come-from.mad-scientist.club

@algernon Thanks! I added this to my HAProxy configuration in a named defaults section for HTTP(S) services and it works

acl firefox-or-chrome hdr_sub(User-Agent) -i 'Chrome/' acl firefox-or-chrome hdr_sub(User-Agent) -i 'Firefox/' acl empty-sfm req.fhdr(Sec-Fetch-Mode) -m found http-request silent-drop rst-ttl 60 if firefox-or-chrome !empty-sfm

I wonder if the following user agents are legit 🤔

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582 Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1

The client using these user agents has Sec-Fetch-Mode header values as well.

@ayushnix Edge/18 is... not likely to be a legit browser. As far as I can tell, the first Edge release was version 88 in 2020. So Edge/18 never existed.

In case of Edge, I'd have a look at the sec-ch-ua header, see if it contains "Chromium" and "Microsoft Edge". Though, the problem with sec-ch-ua is that the best way to use it is to see if it is parseable, and that's hard to do within a HAProxy config, I think.

Safari 14 (signified by the Version/ component) was released in 2020, along with iOS 14. That could be a legit browser, but iOS 14 has been EOL since 2021. My gut feeling is that even if it is a legit browser, it is unlikely to be human operated.

@algernon In a similar vein, does it make sense to reject clients containing Chrome/ in their user-agent but without Sec-CH-UA?

@ayushnix It does, yes, but: in my experience, the Chrome-pretenders that do not send Sec-CH-UA, do not send Sec-Fetch-Mode either, so they're already caught by that.

Those that do send Sec-Fetch-Mode, will also send Sec-CH-UA. However! I'm seeing a significant amount of Sec-CH-UA headers that are either incorrect (they're missing components that should be there based on the user agent), or don't even parse.

Also, if JavaScript is disabled, Chrome will not send the Sec-CH-UA header, but will send Sec-Fetch-Mode, so the latter is more reliable.

So the Sec-CH-UA check I'd recommend is not the verification that it contains Chromium when the user agent has a Chrome/ component, but simply that it can be parsed. In other words: if there's a Sec-CH-UA header, and it fails to parse, it's a bot. If there's a Sec-CH-UA header and it doesn't contain components that should be there based on the user agent, it's also very likely a bot. But the header being absent does not necessarily imply the agent is a pretender.

@algernon Got it, thanks a lot!

I was skeptical about self-hosting services on my residential broadband connection but ever since I started implementing basic measures like banning IP addresses, banning UAs in ai.robots.txt, and the Sec-Fetch-Mode measure, I'm seeing a lot of bot connections getting dropped in my router firewall and my HAProxy logs.

Looking forward to iocaine 3.0 release and the upcoming blog post you mentioned you'll write!