Mastodawn

https://natechoe.dev/blog/2025-08-04.html

Julian Oliver Apr 5

Also working on a zip bomb, to randomly scatter in among the links.

Thanks to @anaiscrosby I came across this excellent method, using LZ77:

TBH I was just going to `dd if=/dev/urandom` my way to a titanic RAM flooding *.gz, but am getting great results with the above, and with bonus site data honey inside to keep bots on the chase.

natechoe.dev - A googol byte zip bomb that's also valid HTML

https://github.com/Draconiator/Ipema

Julian Oliver Apr 5

@anaiscrosby After seeing ChatGPTBot blow 123 seconds on my drip-feed poison tarpit and then never come back, I got reading on how modern LLM scrapers might employ mechanisms to detect tarpits and blacklist.

During research I came across this tarpit evading scraper that provides some interesting insights into how modern LLM scrapers might do this.

This gives me pause and has me looking at other solutions for counter-detection.

The GeoCities CSS is going nowhere.

GitHub - Draconiator/Ipema: A script designed to counter the Nepenthes tarpit - designed with the help of A.I. itself.

A script designed to counter the Nepenthes tarpit - designed with the help of A.I. itself. - Draconiator/Ipema

GitHub

@anaiscrosby Running a non-Markov tarpit for half an hour on one public link, and already have Claude lost in my swamp. Waiting to see if it runs into my ZIP bomb

---
216.73.216.124 - - [07/Apr/2026:03:28:49 +0200] "GET /tarpit/until/same/drive/harmattan_leftmost_intranscalency_few_ministries_few_between HTTP/2.0" 200 10132 "-" "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; [email protected])" "-"
---

@anaiscrosby It hit it, but I guess decompressed in a thread. It's a 127M archive that decompresses to 128GB. The bot kept scraping for a bit and then dropped off. Difficult to know if it was a discouragement.

Strange is that soon after other IPs were reaching statistically non-guessable randomly generated URL paths, without touching the webroot or another other tarpit URL prior. They all had iOS UA strings (readily forged).

It is quite wild how persistent Claude is, and an eerie feeling watching it just roam ever deeper into the endless rhizome of generated linked pages. It's been like this for a couple of hours now, and is not touching any other pages on the server, solely those in the tarpit. So that PoC does seem to check out.

CPU spikes are worrying, so will need to work the threading a bit and provision a couple more cores.

It has a rhythm of ~10-15s gorging, then a pause for 20-30s, and then at it again

Claude is still going. There is now a robots.txt with clear `User-agent: ClaudeBot [...] Disallow /` and it is ignored.

I will say there's a contradiction in setting up a tarpit like this. Sure these crawlers are DoSing anyway - they're uninvited ultra-demanding company - but when you have an infinite maze it feels like volunteering for an exhaustion contest.

My end is CO2e neutral, or at least on traceable renewables. But the other end, who knows. That dimension of it cannot be avoided.

ClaudeBot crashed my tarpit. Working on some rate limiting at the reverse proxy to buy me time to improve the threading.

Rate limit in place. Seems stable and a little less like siege warfare now. ClaudeBot at least, still very much captive.

Still solely ClaudeBot, a page every 2 seconds, but a new src IP of 216.73.216.37. The crawler at that addr has been at it all night. Ceaseless zombie walk through infinitely-hyperlinked randomly generated babble.

My non-Markov text seems far stickier to the ClaudeBot.

Julian Oliver 4d ago

In fact now ClaudeBot has stopped reading, it switched to Anthropic's Claude 'searchbot' during the night. It seems it either gave up or decided to respect the robots.txt.

I misread that in the logs a few hours ago. The above address is in fact that of "[email protected]" not "[email protected]"

An interesting development.

Julian Oliver 3d ago

I moved the project to a giant of a server, getting ready for launch.

Within 15mins (no exaggeration) of bringing up the reverse proxy, with one link in a wiki, OpenAI's gptbot has found the link and hooked into the maze. Very aggressive, so still some rate limiting to do, threading massage before it's good to go.

Julian Oliver 3d ago

GPTbot is still at it, throughout the night. It seems to have a different pattern to ClaudeBot. It pack-feeds from 2 or 3 different endpoints, and is slower, as though processing content in some way in the course of each page scrape. Its rhythm changes too, but I've not yet looked into correlation between payload size or complexity, and pattern. ClaudeBot is far faster, and has an even step as it moves through the babble. Seems to me it is solely concerned with collection.

https://scienceispoetry.net/noodles

OK here goes. You get the first look at this madness.

Go easy on it. I did my best to thread & rate limit while LLM bots this very moment are lost deep inside. They never stop. It's a siege.

Link it in your sites and AI crawlers will get caught when they hit it, veering off into the maze of babble.

Going to come out & say I channeled my inner GeoCities 180% in a nostalgic grief walk back to a happier & weedier www, so that parts for us that still remember

Home |

Heaps to share on the outcome, but that's coming for an explainer on the landing page, which I'll lock out from the bots so I can share methods and arguments freely.

TL;DR after waay too much study and testing I dropped Markov and drip-feeding in favour of a mixed dictionary and 'stop words' model, as it is so much more reliable for catching the main LLM crawlers. Also, I have good reason to believe that GPTbot and Claude at least detect Markov, probably due to the initial infamy of Nepenthes

Julian Oliver

Oh yep, be sure mouse over links in the main paragraphs before you click. I've randomly laced the links with 127MB zipbombs. Those links are to a file at `scienceispoetry.net/files/docs.gzip`. It contains valid HTML, unpacking out to 128GB on disk. You probably don't want that hehe.

OpenAI rn, an appetite that seeks to eat itself up. Like that endlessly hungry beast that came to dinner in Spirited Away

Tell me it looks ridiculous and I'll laugh warmly and say "yes, yes"

Yep it's DoS'd now, server overloaded. Throwing more resources at it soon.

Turns out the new huge load spike is coming from Meta:

```
2a03:2880:f814:14:: - - [10/Apr/2026:22:29:54 +0200] "GET /noodles/were/are HTTP/2.0" 200 10308 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)" "-"
```

Alongside the other bots it's hard to generate & serve them endless garbage fast enough. Cannot shape them down with rate limiting too much or they get hungry and drop off. This happened when trialing Nepenthes.

An interesting challenge. Looking forward to digging into it.

Meta Web Crawlers - Sharing - Documentation - Meta for Developers

This page lists the User Agent (UA) strings that identify Meta’s most common web crawlers and what each of those crawlers are used for.

Meta for Developers

Julian Oliver 1d ago

In good shape now after another tune, running stable under pretty high load.

aburka 🫣2d ago

@JulianOliver yeah they just launched a new torment nexus https://www.cnbc.com/2026/04/08/meta-debuts-first-major-ai-model-since-14-billion-deal-to-bring-in-alexandr-wang.html

@aburka Hmm, seems I get access denied through my VPN to CNBC atm, but will look up this horror show. TY!

themadhatter 2d ago

@JulianOliver try switching to an Iceland VPN. Hetzner will be blocked.