Mastodawn

Julian Oliver Apr 4

Working on some poison-as-a-service (PaaS). Looking to launch in the next few days.

#AI #enjoythinking

Show thread

Julian Oliver Apr 5

Also working on a zip bomb, to randomly scatter in among the links.

Thanks to @anaiscrosby I came across this excellent method, using LZ77:

https://natechoe.dev/blog/2025-08-04.html

TBH I was just going to `dd if=/dev/urandom` my way to a titanic RAM flooding *.gz, but am getting great results with the above, and with bonus site data honey inside to keep bots on the chase.

natechoe.dev - A googol byte zip bomb that's also valid HTML

Show thread

Julian Oliver Apr 5

@anaiscrosby After seeing ChatGPTBot blow 123 seconds on my drip-feed poison tarpit and then never come back, I got reading on how modern LLM scrapers might employ mechanisms to detect tarpits and blacklist.

During research I came across this tarpit evading scraper that provides some interesting insights into how modern LLM scrapers might do this.

https://github.com/Draconiator/Ipema

This gives me pause and has me looking at other solutions for counter-detection.

The GeoCities CSS is going nowhere.

GitHub - Draconiator/Ipema: A script designed to counter the Nepenthes tarpit - designed with the help of A.I. itself.

A script designed to counter the Nepenthes tarpit - designed with the help of A.I. itself. - Draconiator/Ipema

GitHub

Show thread

Julian Oliver 5d ago

@anaiscrosby Running a non-Markov tarpit for half an hour on one public link, and already have Claude lost in my swamp. Waiting to see if it runs into my ZIP bomb

---
216.73.216.124 - - [07/Apr/2026:03:28:49 +0200] "GET /tarpit/until/same/drive/harmattan_leftmost_intranscalency_few_ministries_few_between HTTP/2.0" 200 10132 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected])" "-"
---

Show thread

Julian Oliver 5d ago

@anaiscrosby It hit it, but I guess decompressed in a thread. It's a 127M archive that decompresses to 128GB. The bot kept scraping for a bit and then dropped off. Difficult to know if it was a discouragement.

Strange is that soon after other IPs were reaching statistically non-guessable randomly generated URL paths, without touching the webroot or another other tarpit URL prior. They all had iOS UA strings (readily forged).

Show thread

Julian Oliver 5d ago

It is quite wild how persistent Claude is, and an eerie feeling watching it just roam ever deeper into the endless rhizome of generated linked pages. It's been like this for a couple of hours now, and is not touching any other pages on the server, solely those in the tarpit. So that PoC does seem to check out.

CPU spikes are worrying, so will need to work the threading a bit and provision a couple more cores.

It has a rhythm of ~10-15s gorging, then a pause for 20-30s, and then at it again

Show thread

Julian Oliver 5d ago

Claude is still going. There is now a robots.txt with clear `User-agent: ClaudeBot [...] Disallow /` and it is ignored.

I will say there's a contradiction in setting up a tarpit like this. Sure these crawlers are DoSing anyway - they're uninvited ultra-demanding company - but when you have an infinite maze it feels like volunteering for an exhaustion contest.

My end is CO2e neutral, or at least on traceable renewables. But the other end, who knows. That dimension of it cannot be avoided.

Show thread

Benjamin Balder Bach 5d ago

@JulianOliver I wonder how often bots should re-load robots.txt. Not for each request, right?

Show thread

Julian Oliver

@benjaoming On session initiation I think. I'm going to try to restart the service now,.

Show thread

Julian Oliver 5d ago

@benjaoming Terminated HTTP session, restarted service, connection established by ClaudeBot, and robots.txt in webroot was not read. Same src IP.

Oddly, pace of crawl is now about 1 every 5 seconds, whereas before it was closer to the inverse.

Show thread

Benjamin Balder Bach 5d ago

@JulianOliver tried to find some info on this, and it could seem like the convention by Google is to assume robots.txt is cached for up to 24 hours. Source: https://bsky.app/profile/johnmu.com/post/3lfud4v4lf22l

John Mueller (@johnmu.com)

It's a bad idea because robots.txt can be cached up to 24 hours ( https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#caching ). We don't recommend dynamically changing your robots.txt file like this over the course of a day. Use 503/429 when crawling is too much instead.

Bluesky Social