Wondered why our latest podcast episode didn’t show up on https://workingdraft.de this morning. In our headless WP we preschedule releases and @11ty builds the front facing site daily. Turns out an AI bot broke the build: our log-parsing stats step choked on its UA string:

Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.)

"if_the_Thinkbot_brings_you_trouble" 🖕

Working Draft

Wöchentlicher Podcast für Frontend Devs, Design Engineers und Web-Entwickler:innen

@Schepp @11ty 🤦‍♂️
@heydon @Schepp @11ty haha this bot also went straight into my honeypot*… repeatedly.
* a directory on my website that only is mentioned in the robots.txt with a disallow and not linked anywhere.
so this motherboardfucker (excuse my french) is actually looking in the robots.txt but then sees a disallow as an invite

@webrocker @heydon @Schepp @11ty

I thought robots.txt were completely disregarded but most ai companies publish their ip address ranges and you write some redirect rules to block them scraping your site.

I think there's one specific company who were completely opaque about that and published false ip addresses. Perplexity (I couldn't think of the name straight away), so surely there are other companies doing the same thing.

https://rknight.me/blog/perplexity-doesnt-give-a-shit-about-consent/

Perplexity Doesn’t Give a Shit About Consent

Perplexity proving yet again they don't care about the rules

@lukeharby @heydon @Schepp @11ty I wonder how else if not via my robots.txt entry the bots would discover my unlinked directory. to be fair, there are only a few hits per day in there, but this "thinkbot"(and its user agent string) made a lasting impression.