Quick Lunchtime Hack: Reducing AI Bots
So yeah, one of my little pet servers (instead of like an hugely well resourced work box) decided to unleash the OOM-killer and whack my long running tmux connection just before I went and made some lunch, which was rude.
Doing a bit of digging it seems like a bunch of AI crawlers was responsible, and being as I was hosting a bunch of little domains and sites for people on this machine I wanted to put something in place to try and reduce the quantity of traffic from these sources
I poked around and found a nice list of known scummy user agents over on the ai.robots.txt project, so I downloaded a copy, then whipped up a quick little something to turn its robots.txt file into a little snippet of Apache config (because I’m not hiding said websites behind haproxy or Caddy or anything, see above, little pet server)
#!/usr/bin/bash
input_file="$1"
if [[ ! -f "$input_file" ]]; then
echo "No input_file '$input_file'"
exit 1
fi
ua_regexp=$( grep "^User-agent:" "$input_file" | cut -d: -f 2 | tr -d ' ' | tr -c -d 'a-zA-Z0-9/\._\-\n' | tr '\n' '|' | sed -e 's/|$//' )
if [[ -n "$ua_regexp" ]]; then
cat <<EOF
BrowserMatchNoCase ".*(${ua_regexp}).*" is_an_ai
<LocationMatch ".*">
<RequireAll>
Require all granted
Require not env is_an_ai
</RequireAll>
</LocationMatch>
EOF
fiThen I can just IncludeOptional the file I generate from that into my vhosting macro and its a one liner that blocks at least quite a lot of AI scrapers.
Nice :) Now just to see if it helps reduce the OOMing or if something else is contributing.
#code #complaining #internet