Mastodawn

grebedoc had its highest share yet of serving garbage requests yesterday (a wave peaking at 150 req/sec)

these waves are getting bigger and bigger which is somewhat concerning. it's nowhere near the hardware capacity yet but i'm hitting some software bottlenecks that i've never thought would be relevant

Show thread

✧✦Catherine✦✧1d ago

git-pages has a sophisticated multilayer cache system which fails to perform well in exactly one case: if someone sends a lot of requests to domains that don't even have valid sites deployed

because i figured that nobody would do this. certainly that nobody would do it regularly and at incredibly high speed

well. fucking scrapers

Show thread

✧✦Catherine✦✧1d ago

i'm going to have to add a Bloom filter and another cache invalidation mechanism which i'm not enthusiastic about but it seems prudent to do it before it results in an outage (grebedoc has never had a scraper-induced outage so far, and neither had the codeberg git-pages instance)

Show thread

Eashwar 1d ago

@whitequark Is the additional cache invalidation to handle removals from the bloom filter? Are you just planning to rebuild the bloom filter periodically or ...?

Show thread

✧✦Catherine✦✧1d ago

@e_nomem rebuild whenever a domain is added or removed (or on a superset of those operations, ideally a small superset to avoid waste of resources) but not more often than e.g. 60s

Show thread

Taneb 1d ago

@whitequark do you need to rebuild when a domain is removed? Given that there'll be false positives anyway... (and inserting a domain should be relatively cheap with a Bloom filter, until the false positive rate gets higher than you want it to)

Show thread

✧✦Catherine✦✧1d ago

@Taneb yeah now that you mention it, not really

Show thread

Chris 1d ago

@whitequark bloom filter was also the first thing that came to my mind when reading this thread. It’s the Wild West out there, apparently.

Show thread

yopp 21h ago

@whitequark nginx has amazing limit_req_module that can easily throttle IP’s that do some nasty shit, like doing a lot of 404 request. You can just tell it to spit 1bbps to connections that fall in given zone.

It’ll cost you having open connections, but otherwise cheap way to solve this without doing another layer of caching.

But given you have caddy for tls provision, it’s not immediately obvious how to front it with nginx

Module ngx_http_limit_req_module

Show thread

Andrew 1d ago

@whitequark now that I actually look at git pages I realize it’s exactly the thing I wanted to make a few months back to replace ReadTheDocs, I’ll have to try it out, thanks

Show thread

✧✦Catherine✦✧1d ago

@cinebox oh nice, how much replacing are we talking about? like for your own needs or as a service for others?

Show thread

Andrew 1d ago

@whitequark just for myself and some community projects

Show thread

✧✦Catherine✦✧1d ago

@cinebox oh yeah that's basically how i started grebedoc

one thing git-pages intentionally omits is any sort of "run user-provided code in a container" because i believe that most of the solutions here cannot be left unattended if you expect to not be compromised by malware at some point. maybe firecracker vms would work but this still has a lot of issues. so i just let people use forgejo actions or something if they need processing

Show thread

Andrew 1d ago

@whitequark yeah I already have forgejo actions for that. I just needed a solution for deploying the resulting html.

Show thread

✧✦Catherine✦✧1d ago

@cinebox yeah https://codeberg.org/git-pages/action

action

Forgejo Action for uploading a directory to a git-pages site

Codeberg.org

Show thread

truh 1d ago

@whitequark how does one even mess up scraping that badly?

@truh beats me

@truh @whitequark Believing that LLMs are in fact AI.

I help manage a site where 'deep' URLs follow obvious patterns. The elements are obvious & one can build millions of possible URLs for the site using public info, mostly of which don’t exist.

The so-called "AI Scrapers" have been asking for thousands of such invented URLs at the site all at once, with most of those which could be correct taking a few seconds to construct from mostly-archived data. The scrapers don’t even wait.

Show thread

Daphne Preston-Kendal 1d ago

@whitequark How do you define/measure garbage requests vs non-garbage?

Show thread

✧✦Catherine✦✧1d ago

@dpk with these waves it's pretty easy because they send Host: requests whose A/AAAA records don't point to grebedoc and never did (it's something like every domain from the Cloudflare 1M list, last time we investigated it)

requests that have no business being ever sent to grebedoc

some others try to brute-force wp-admin paths and such

Show thread

Hugo Mills 1d ago

@whitequark @dpk It's jaw-dropping quite how awful some of this stuff is. Things that will never (to within probability of epsilon) work. I just don't know if it's ignorance (with unlimited resources) or malice (with unlimited malice). Either way, it engenders great anger.

Show thread

Mutesplash 1d ago

@whitequark Are the source IPs from China? Have seen people assert their "new" way to block sites in their country is to randomize the IP addresses that are returned.

Show thread

✧✦Catherine✦✧1d ago

@Mutesplash I don't log source IPs

Show thread

solo 1d ago

@whitequark like what bottlenecks?

Show thread

✧✦Catherine✦✧1d ago

@solonovamax i send an S3 request to Wasabi every time there's a cache miss, including for domains that have never been served by grebedoc. if i'm getting, say, 100k requests to 100k domains i've never seen in a row, these start to really plug up in the worker process. i still have good latencies overall but only on most of the waves, not every single one anymore

Show thread

violet 1d ago

wonder at which point having the domain list cached on each node starts being an optimisation. i assume it'd be more resource usage than it's worth right now

Show thread

✧✦Catherine✦✧1d ago

@avioletheart.me @solonovamax the current plan is:

make a bloom filter out of the domain list (strictly speaking i could use a hashset here but i think using a bloom filter is basically as much work and should serve even very large instances)
put a marker object with timestamp of a last 'significant update'
fetch this marker object once per 60s and update the bloom filter if it got stale
'significant update' ideally would be 'a new domain was added' but given the (lack of) directory structure that s3 has i'm not sure that's feasible, it might just have to be 'a new site was added' which is basically as good. i would like to avoid refreshing the list on every site update, that seems like it'll cause a lot of traffic on big instances

Show thread

violet 1d ago

what is the difference between a new domain and a new site being added?

Show thread

✧✦Catherine✦✧1d ago

@avioletheart.me @solonovamax a site is like domain.tld or domain.tld/directory, you know like github lets you have a main site and project-specific ones on the same domain

Show thread

solo 1d ago

@whitequark @avioletheart.me please use cuckoo or xor filters

they're (to my understanding) simply better than bloom filters

Show thread

✧✦Catherine✦✧1d ago

@solonovamax @avioletheart.me is there a go library i should use

Show thread

solo 1d ago

@whitequark @avioletheart.me it seems like there's packages for cuckoo, xor, ribbon, and binary fuse filters

I'm not sure which out of all of those is best, however afaik any of those will be significantly better than bloom filters.

though idk if those packages are good quality or not, but realistically making your own implementation is relatively straightforward as filters are not really doing a lot. they're just a bit of math and then some lookups into a big byte array

Show thread

✧✦Catherine✦✧1d ago

@solonovamax @avioletheart.me the amount of verification effort i want to spend on this is very little so if i have to do more work than "vet a library" i'll just use a hashset instead

Show thread

solo 1d ago

@whitequark @avioletheart.me that's fair

libraries exist for basically all of them, and then just searching for a quick comparison between like cuckoo, xor, and (maybe) binary fuse filters should probably give you a good enough idea of which one to choose that you don't need to think about it further

Show thread

solo 1d ago

@whitequark @avioletheart.me though also you don't have to worry about there being any AI in most of these, as they were written like several years ago lol

Show thread

solo 1d ago

@whitequark that's... insane, wtf

Show thread

✧✦Catherine✦✧1d ago

@solonovamax i think someone is trying to probe grebedoc trying to figure out which of the top-N domains it's serving. instead of you know. using fuckin dns

Show thread

Garth Kidd 1d ago

@whitequark @solonovamax yecch part of me wants to flail at it with a bloom filter but the rest resents [expansive gesture] externality

Show thread

✧✦Catherine✦✧1d ago

@garthk @solonovamax i knew what i was getting into, that's why git-pages has so many layers of defense woven into it from the start

i just haven't expected people to send millions of requests to domains that don't even resolve to grebedoc