Mastodawn

Alright my web friends! 👋 Hands up who has experienced a surge in (LLM) bot traffic recently and maybe even had to take steps against them? I’m writing a blog post about this atm and it would be great to hear whether others are experiencing the same with their #blogs and personal #websites. #RT == 💚

Show thread

Robb Knight Feb 1

@matthiasott This is what I currently do https://rknight.me/blog/blocking-bots-with-nginx/

But also yes, I’m seeing more of it. Specifically Chinese bots that go over every single page on my site one by one.

Blocking Bots with Nginx

How I've automated updating the bot list to block access to my site

Show thread

Tom Anypuppies Feb 1

@matthiasott hard to tell for my personal sites b/c I don't monitor the traffic/stats at all. but in client projects having trackers there's been a huge increase in botish visits in the recent months, and even those who claim to filter out the known bots from their traffic stats have seen a jump in suspicious looking visits

Show thread

Jeremy Keith Feb 1

@matthiasott Yes, it’s bad: https://adactio.com/journal/21831

Denial

The best of the web is under continuous attack from the technology that powers your generative “AI” tools.

Show thread

Matthias Ott Feb 1

@adactio Thanks, Jeremy!

Show thread

Harald Feb 1

@matthiasott

Check more or less recent posts from
@ [email protected]
@ [email protected]
@ [email protected]

Show thread

Harro van der Klauw Feb 1

@matthiasott yes, a while ago, 99% of our load came from AI crawlers, after updating the robots.txt it became less, but there are still a lot of bad actors out there and that's the ones who at least set the user agent to identify themselves, as user agent blocking becomes more common I really wonder how bad they'll get

Show thread

Manuel Strehl 🫏Feb 1

@matthiasott hell yes! This is a terrible problem on my site codepoints.net, even with CloudFlare in front of it.

On single code point pages I deep-link to my site search for similar code points. In the last year it became worse and worse that “users” from China with Chrome follow those links and bring the site down due to excessive DB load from the search.

I hated having to add rate limiting etc, and I know of at least one legitimate user who was bitten by it.

Such a pest on the open web!

Show thread

Doctacosa Feb 1

@matthiasott I only heard stories until this week, when I was hit. I have a pet project that's fetching a lot of data and is a bit slow, but perfectly fine for human use. However, I suddenly had both Baidu and Meta's crawlers loading several pages at once, slowing everything down dramatically. Blocking both bots from that section of my site with robots.txt solved the problem... at least for now!

Show thread

Tyler Sticka Feb 1

@matthiasott Yep. Not as bad as others, but most recently I’ve seen a surge in traffic from Singapore. Will be updating my `robots.txt` list and looking into blocking soon.

Show thread

Tyler Sticka Feb 1

@matthiasott As of a few minutes ago I'm now blocking known bots via `.htaccess` (not just `robots.txt`) 🤞

Show thread

Tyler Sticka Feb 5

@matthiasott Following up, doesn’t seem to have made any difference

Show thread

Mayank Feb 1

@matthiasott yeah… and it's one of the big reasons i stopped blogging.

@mayank oh no! 😔

@mayank @matthiasott oh no! I loved your writing.

Show thread

Hidde Feb 1

@matthiasott 👋 yes, after a few times of an unexpectedly larger Netlify bill due to astronomical traffic (I think due to LLM bots) I switched hosts so that I can have a larger plan at lower cost.

Show thread

Hidde Feb 1

@matthiasott thanks this reminded me to update my AI statement

Show thread

Cory Dransfeldt

Feb 1

@matthiasott I'm listing AI (and other poorly behaved crawlers) in my robots.txt while also sending them a 403 to anything else they request should they not honor that. I also had to block all traffic from China due to bots fetching endless pages sequentially. https://www.coryd.dev/posts/2026/blocking-entire-countries-because-of-scrapers

Blocking entire countries because of scrapers

Cory Dransfeldt

Show thread

Andreas, DJ3EI, he/him Feb 1

OSM mentioned that as a poblem in https://en.osm.town/@osm_tech/115968544599864782 and other messages.

@matthiasott

OpenStreetMap Ops Team (@[email protected])

If you write about the messy reality behind "free" internet services: we're seeing #OpenStreetMap hammered by scrapers hiding behind residential proxy/embedded-SDK networks. We're a volunteer-run service and the costs are real. We'd love to talk to a journalist about what we're seeing + how we're responding. #AI #Bots #Abuse

OSM Town | Mapstodon for OpenStreetMap

Show thread

Maurice Feb 1

@matthiasott I have this once or twice a week and now started geo blocking, currently Singapore. There were even some bots saying "sorry we are beta and break things, block us if you want" in their user agent string... So I blocked some user agents too. But most of the requests look like normal desktop browsers, so i needed geo blocking. The requests came so rapidly I couldn't use the Kirby panel anymore even though I already had caching enabled.

Show thread

Maurice Feb 1

@matthiasott I have some stats and the max was an increase of "visits" by about 1000%, when those crawlers hit.

Show thread

Wenzel Massag Feb 2

@matthiasott I had a short but interesting conversation with the owner of a hosting company.

Here is the gist of it:

Part of the explanation for the surge in traffic can be an endpoint that takes variables via HTTP-GET, because the bots then try all possible combinations of variables. By making content only available via one URL per piece, say, articles cannot be linked to with a …?related=tag1,tag2,tag3… you should be able to reduce the load.

Show thread

Matthias Ott Feb 2

@stairjoke That’s really interesting – because this is actually part of what I did to reduce the load a bit! My notes page used to work with multiple tags as URL params. And I indeed saw a lot of requests by bots trying all kinds of combinations. I now reduced this to one tag, which already helped a bit. Although I also did a lot of other stuff, so I can’t say for sure how much which step helped exactly. 😅