Mastodawn

building a little log analyser thing and testing it out on my site and blog

holy shit people are not joking about the AI bots. 92.2% of my total site traffic is from eight user agents directly affiliated with AI. another 6.1% is from SEO related companies that have some sort of AI offering.

only 0.3% of my traffic comes from a regular browser. most of the remaining 1.4% is fedi servers pulling previews, plus some RSS readers grabbing posts.

Show thread

Graham Sutherland / Polynomial Jun 1

I don't get a ton of traffic there anyway, it's not like thousands of people read my blog, but wow is that far more bleak than I ever imagined.

Show thread

Graham Sutherland / Polynomial

one thing I did notice is that there are quite a few user agents that point to bots that purport to be news aggregators or similar types of sites, and when you check the site out they have a flashy facade that looks like some sort of "collect your favourite stories and news sources" kind of thing, except there's no signup, no login, nothing, and when you look up the company owners they've got AI stuff all over their linkedin. almost certain these are just fronts for training data collection.

Show thread

Graham Sutherland / Polynomial Jun 2

muting this now

Show thread

Frank Barton Jun 1

@gsuberland your stats track what I’m seeing

Show thread

Dr David Mills Jun 2

@fbarton @gsuberland Ditto. They don;t even seem to check the stuff they are pulling down has changed since last time. I was burning 10 gig a month serving exactly the same stuff to the same bots.

I've got some mitigation in place now.

Show thread

Frank Barton Jun 2

@dtl @gsuberland ya’know… if you’re lying for egress, that’s very fair. Self hosting means that that is a lot less important to me

Show thread

europlus

Jun 2

@gsuberland how does an amateur do the same check then block the bots?

Show thread

Graham Sutherland / Polynomial Jun 2

@europlus for blocking, search "apache2 block user agent" or "nginx block user agent" depending on which web server you're using. tons of guides. feed in a list.

there's a bad bot list here: https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/_generator_lists/bad-user-agents-htaccess.list

I mentioned more here: https://chaos.social/@gsuberland/114611003632638741

re: doing the analysis, no idea. I'm writing my own thing for simple visualisation because most of the "proper" tools for doing it are cumbersome.

Show thread

europlus

Jun 2

@gsuberland thank! Sorry to have been lazy, I’m a little overwhelmed right now.

Show thread

Graham Sutherland / Polynomial Jun 2

@europlus all good :)

Show thread

Hannes Jun 2

@gsuberland @europlus have you tried Anubis?
https://github.com/TecharoHQ/anubis
Heard it's pretty efficient in blocking the bots, because it's not using user agents.

I have yet to try myself, it's in the big pile of "things I want to try"...

GitHub - TecharoHQ/anubis: Weighs the soul of incoming HTTP requests to stop AI crawlers

Weighs the soul of incoming HTTP requests to stop AI crawlers - TecharoHQ/anubis

GitHub

Show thread

Graham Sutherland / Polynomial Jun 2

@hannsr @europlus it is good, but not compatible with my specific requirements on my site (zero server side code, zero JS)

Show thread

0x5c Jun 2

@hannsr @gsuberland @europlus it uses the UA; it only triggers when the user agent string contains "Mozilla"
That's supposedly most LLM scrappers, but is also definitely most of the legitimate visitors of a site since all mainstream browser engines are Mozilla or claim Mozilla compatibility in their UA

Show thread

europlus

Jun 2

@gsuberland OK, I’ve put a useragent.rules file following https://www.xmodulo.com/block-specific-user-agents-nginx-web-server.html and using the lists you provided. Seems to be in place and picking up some hits🤞🏻

How to block specific user agents on Nginx web server

This post explains how to block certain user-agent on nginx web server as a way to block malicious bots from accessing your website.