Wondered why our latest podcast episode didn’t show up on https://workingdraft.de this morning. In our headless WP we preschedule releases and @11ty builds the front facing site daily. Turns out an AI bot broke the build: our log-parsing stats step choked on its UA string:

Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.)

"if_the_Thinkbot_brings_you_trouble" 🖕

Working Draft

Wöchentlicher Podcast für Frontend Devs, Design Engineers und Web-Entwickler:innen

@Schepp @11ty 🤦‍♂️
@heydon @Schepp @11ty haha this bot also went straight into my honeypot*… repeatedly.
* a directory on my website that only is mentioned in the robots.txt with a disallow and not linked anywhere.
so this motherboardfucker (excuse my french) is actually looking in the robots.txt but then sees a disallow as an invite

@webrocker @heydon @Schepp @11ty

I thought robots.txt were completely disregarded but most ai companies publish their ip address ranges and you write some redirect rules to block them scraping your site.

I think there's one specific company who were completely opaque about that and published false ip addresses. Perplexity (I couldn't think of the name straight away), so surely there are other companies doing the same thing.

https://rknight.me/blog/perplexity-doesnt-give-a-shit-about-consent/

Perplexity Doesn’t Give a Shit About Consent

Perplexity proving yet again they don't care about the rules

@lukeharby @heydon @Schepp @11ty I wonder how else if not via my robots.txt entry the bots would discover my unlinked directory. to be fair, there are only a few hits per day in there, but this "thinkbot"(and its user agent string) made a lasting impression.
@webrocker @heydon @Schepp @11ty I had an interesting AI-encounter the other day about which rules AIs obey. Maybe I need to write a blog article about it…
@MoritzGlantz @heydon @Schepp @11ty Inspired by this incident I have now completed my feeble defense against those bots that visit my hidden directory. Their IP is saved in a nosql and the single entry to my website checks the current vistor's IP against that nosql and returns a 403 if the IP matches. I sucessfully logged me out of my website by visiting my hidden dir. jay.

@webrocker Why not send them a multigigabyte file to crawl? Or pollute them with Heydon's script?

@MoritzGlantz @heydon @Schepp @11ty

@Lippe @webrocker @heydon @Schepp @11ty A zip bomb maybe? 🤔

@MoritzGlantz Thought of this, but will they at all crawl zip files?

@webrocker @heydon @Schepp @11ty

@Lippe @MoritzGlantz @heydon @Schepp @11ty well if I don't want them to waste the resources on my site, how would serving a gazillion of data help?

@Schepp By accident I yesterday discovered that the website I host for the local hiking club my parents are in consumed a whopping 170GB traffic in February, 230GB in January, 190 in Dec, you get it... Over the past 12 months this accumulated to 1.2TB of traffic, picking up steam since July.
Just now looking into some of the logs I see lots of bytedance, lots of facebook crawler, and such...

It's a tiny wordpress site, where the club shares some pictures of recent hikes and info on the next...