*KICKS DOOR DOWN*

Hey everyone! Hate AI web crawlers? Have some spare CPU cycles you want to use to punish them?

Meet Nepenthes!

https://zadzmo.org/code/nepenthes

This little guy runs nicely on low power hardware, and generates an infinite maze of what appear to be static files with no exit links. Web crawlers will merrily hop right in and just .... get stuck in there! Optional randomized delay to waste their time and conserve your CPU, optional markovbabble to poison large language models.
Nepenthes - ZADZMO.org

Making web crawlers eat shit since 2023

@aaron am wondering what it would take to swap the text with Rick Astley lyrics.
@Workshopshed Trival. It starts with no corpus by design; you provide one and POST it intoa specific training input with curl.
@aaron @Workshopshed I wonder if there would be worse but more efficient algorithms to replace the probably very accurate Markov Chains you're using now...
@mdione Markov chains are extremely simple - and thus, fast. The way I put this one together also trades increased corpus size for more speed. In Nepenthes it has a depth of two, which is rather incoherent but the fastest you'll get with realistic text. I consider that extra incoherence to be a positive thing in this use case.

It's slowed, however, by the fact the corpus it's stored in SQLite, and not RAM. This causes the bottleneck to be IO throughout to disk reads, somewhat mitigated by OS buffering if you have spare memory for it.

Holding the corpus entirely in memory is a thing I've done, but it both consumes a huge amount of RAM and requires retraining at every restart. @Workshopshed
@mdione I tried several different SQLite schemas with various amounts and ways of normalization, and succeeded in reducing table or index sizes or simplify query plans - but the current dead simple basic one in use won every time, often by huge margins. I tried LightningMDB - it's performance is truly exceptional. But ultimately, it was half as fast, because there's not a way to represent the Markov corpus purely in key-value pairs. I got it to work by serializing a Lua table; that step completely swamped all performance gains and then some.

Feel free to try to find something faster. I'll be impressed if you do :)
@aaron thanks for all the details. I keep asking myself if we shouldn't document failures more...