Mastodawn

elfenlaid 2d ago

It had to happen, eventually. My AI crawler antagoniser, https://www.ty-penguin.org.uk/~auj/spigot/ has been seeing sustained traffic of between 300 and 500 thousand hits per hour. I've not been particularly bothered by that, but a couple of days ago, my provider, @bitfolk, sent me a bandwidth warning: I'm on track to hit 2TBytes of outbound bandwidth this month and end up paying for the excess.

So I've added firewalling - if more than 5% of machines in a /23 network hit spigot within an hour, then the entire network gets a temporary block until it completely stops hitting my server. Hopefully that will cut things back enough to avoid charges.

The thing that amazes me is that the list has already accumulated nearly 10,000 entries. Put another way, I'm already blocking 0.12% of the whole IPV4 address space because it's being used for web crawling.

An infinite maze of twisty little pages

Alun Jones 1d ago

Well that seems to have annoyed them.

Over the past few hours, request rates have been ramped up to nearly 900,000 hits per hour from nearly 700,000 distinct IP addresses. This is not including the many thousands that are firewalled off, but still trying their best. I'm turning page generation off for a bit while I ponder what to do next.

Alun Jones 1d ago

And... ouch! I couldn't sleep, so ended up getting up for a deeper look at what's going on. Even with the system returning a 300 byte "Load too high" page, the Bitfolk dashboard was still showing much higher bandwidth usage than the nginx logs would imply.

After a lot of messing around with tcpdump etc, I finally realised. It's SSL overheads. While the payload data only amounts to around 40kbytes per second, the protocol overheads caused by hundreds of new SSL connections per second raise that to around 700kbytes/s. This is, of course, made worse by the fact that I'm being hit by hundreds of thousands of machines, each making only one or two requests per hour. Almost no connections are reused, so the SSL negotiation happens for almost every request.

The big issue is that I can't think of a simple answer to this - even turning Spigot off completely is still going to result in SSL sessions being established before my web server can even tell whether the request is being directed towards Spigot. In the short term, I'm just going to have to weather the storm.

Edit: Just did a quick test. Pulling a 404 page with 49 bytes of content involves 3.9Kbytes of network traffic from my web server. Short of removing www.ty-penguin.org.uk from the DNS, there's nothing I can do to prevent this traffic.

Ty Penguin

Alun Jones 1d ago

To clarify: crawlers seem to operate with massive lag. I changed the structure of Spigot URLs several months ago, and over 30% of requests to Spigot URLs are still in the old format. So even if I turn off Spigot right now, I can expect to be receiving requests for its pages for months or years. And I'll only know what URLs are being requested after the SSL session has been established. So every failed request is going to cost me around 3.9k of bandwidth.

With hindsight, I should have put Spigot on its own virtual host, so that turning it off would be just a matter of getting rid of its DNS entry. 18 months ago, though, I liked the idea of mixing the garbage in among my real content.

On the bright side, this is going to be a very fertile source of abusive IP addresses.

Alun Jones 23h ago

It's actually daylight now and I'll need to get ready for work, so maybe I'll shut up soon. Anyway, pondering a bit further...

Spigot generates page content and links using Python's random number generator. To make it deterministic (i.e. the same URL will always give the same page), it seeds the random number generator just before creating the content, with the seed being a 64 bit hash of the page URL.

Effectively, this means that Spigot's entire "page space" is around 1.8E19 pages. In terms of trapping crawlers, that's near enough infinite - at a million requests per second, it would take over half a million years to exhaust all possible pages.

My problem, right now, is that the crawlers have made around 1.2 billion requests to Spigot, which means their (aggregate) index probably holds around 30 billion Spigot URLs, most of which are going to be in a backlog for later scanning (hurrah). I can't get away from that, and I guess I'll need to live with it until the AI bubble bursts.

I don't really want to get rid of Spigot completely - if for no other reason than I've enjoyed tinkering with it.

And it's struck me that I could have a tunable "site size" value. When the random number generator gets seeded, rather than using the 64 bit number it uses that number modulo the "site size" value. So, if the site size was only 100, we'd only seed the RNG with one of 100 values, which would mean that only one of 100 possible pages would ever be created. I'd need to restructure things a bit, so that internal links remained internal (i.e. when generating internal links, it would need to generate them by choosing a random number, taking that number modulo the site size value and running the page generator seeded with the result to generate page title and URL). That's fiddly, but not a major problem.

It's a reasonable assumption that crawlers don't go round and round and round hitting the exact same URL thousands of times per day, so a spigot with (say) a million possible pages could be useful in poisoning models without exposing me to ongoing load.

Food for thought. Talking of food: time for breakfast!

@pengfold tarpitting?

Alun Jones 1d ago

@avatastic already doing that for the worst offenders.

@pengfold have you tried linkmaze and quixotic?

https://marcusb.org/hacks/quixotic.html

Quixotic

Quixotic is a nonsense generator designed to help static website operators confuse and confound bots and content-stealing LLM scrapers.

Alun Jones 1d ago

@mxfraud spigot is the one I wrote which does a very similar job. It's been a fun project. The obvious solution, if I wanted to stop tinkering, would be to turn it off completely. Almost everything else on my site is static, and I doubt I'd even notice the load caused by a million requests an hour!

@pengfold I didn't realise, nice one!

I also have go-pot running, which instead of giving the attacker shitty pages faster, gives fake secrets slower: https://github.com/ryanolee/go-pot

I've seen people using crowdsec, and it seems to work well for them.
I looked at the config and didn't quite managed to get it running.
It does seem like it would help with the 700k unique IP problem tho.
https://github.com/crowdsecurity/crowdsec

GitHub - ryanolee/go-pot: A service for giving away secrets to bots ...Probably slightly too many!

A service for giving away secrets to bots ...Probably slightly too many! - ryanolee/go-pot

GitHub

Alun Jones 1d ago

@mxfraud its a fine balancing act. I want to poison their well by supplying garbage, but that does mean engaging with these abusive bots. Over the past 18 months, it's generally been relatively easy. But they're all engaging in increasingly DDoS-like behaviour and it's getting less easy to provide the garbage while maintaining service.

At least this is just a toy that I can turn off. I pity folks trying to cope with this sort of thing professionally. As it happens, I was doing exactly that until about 4 years ago, when I was given a chance at playing in a different field.

@pengfold mine seemed to have changed their algorythm and lost interest, so I get no more DDOS.
That or my ISP blocked it without telling me.

Since my server is here, I don't really pay for the bandwidth, so I was just so happy to waste their cpu cycle.

I guess the gap between hobby and professional (eg. Cloud) solution is wider than before, and I'm happier doing the hobby part than the professional part that's for sure.
I agree to be able to ignore it, is good. But pushing it to maximum nonsense until it breaks is just so liberating :)

@pengfold Is it really not feasible to send compressed responses i wonder? even if you absolutely have to compress on the fly, a fast implementation of deflate, or zstd, might be able to keep up

Alun Jones 3d ago

@zaire I'm already compressing (and sending "interestingly tweaked" gzip, png and jpeg payloads for their pleasure). It's worth remembering, though, that decompression is typically less CPU-heavy than compression. If I'm doing the compressing and they're doing the decompressing (and they're also able to deploy hundreds of thousands of machines to doing it) then they're always going to be able to exhaust my resources, be they bandwidth or CPU.

@pengfold @bitfolk 0.12%?????????? jesus

BitFolk Ltd 3d ago

@pengfold
Just to be clear, this is interesting work, we are happy to comp you a bit more transfer allowance to take the pressure off so you don't have to rush to implement any changes. Send a support ticket to discuss if you want…

Alun Jones 3d ago

@bitfolk and this is why I really love Bitfolk!

Thanks for the offer. I'll monitor things over the next day or so and decide if I need to do more.

@pengfold @bitfolk I'm blocking the whole Facebook ASN at home because their AI scrapers kept polling gigabytes of iso and targz files from a directory literally named garbage and I was annoyed by the noise of the hard drives.