Mastodawn

grebedoc had its highest share yet of serving garbage requests yesterday (a wave peaking at 150 req/sec)

these waves are getting bigger and bigger which is somewhat concerning. it's nowhere near the hardware capacity yet but i'm hitting some software bottlenecks that i've never thought would be relevant

Show thread

solo 1d ago

@whitequark like what bottlenecks?

Show thread

✧✦Catherine✦✧1d ago

@solonovamax i send an S3 request to Wasabi every time there's a cache miss, including for domains that have never been served by grebedoc. if i'm getting, say, 100k requests to 100k domains i've never seen in a row, these start to really plug up in the worker process. i still have good latencies overall but only on most of the waves, not every single one anymore

Show thread

violet 1d ago

wonder at which point having the domain list cached on each node starts being an optimisation. i assume it'd be more resource usage than it's worth right now

Show thread

✧✦Catherine✦✧1d ago

@avioletheart.me @solonovamax the current plan is:

make a bloom filter out of the domain list (strictly speaking i could use a hashset here but i think using a bloom filter is basically as much work and should serve even very large instances)
put a marker object with timestamp of a last 'significant update'
fetch this marker object once per 60s and update the bloom filter if it got stale
'significant update' ideally would be 'a new domain was added' but given the (lack of) directory structure that s3 has i'm not sure that's feasible, it might just have to be 'a new site was added' which is basically as good. i would like to avoid refreshing the list on every site update, that seems like it'll cause a lot of traffic on big instances

Show thread

solo 1d ago

@whitequark @avioletheart.me please use cuckoo or xor filters

they're (to my understanding) simply better than bloom filters

Show thread

✧✦Catherine✦✧

@solonovamax @avioletheart.me is there a go library i should use

Show thread

solo 1d ago

@whitequark @avioletheart.me it seems like there's packages for cuckoo, xor, ribbon, and binary fuse filters

I'm not sure which out of all of those is best, however afaik any of those will be significantly better than bloom filters.

though idk if those packages are good quality or not, but realistically making your own implementation is relatively straightforward as filters are not really doing a lot. they're just a bit of math and then some lookups into a big byte array

Show thread

✧✦Catherine✦✧1d ago

@solonovamax @avioletheart.me the amount of verification effort i want to spend on this is very little so if i have to do more work than "vet a library" i'll just use a hashset instead

Show thread

solo 1d ago

@whitequark @avioletheart.me that's fair

libraries exist for basically all of them, and then just searching for a quick comparison between like cuckoo, xor, and (maybe) binary fuse filters should probably give you a good enough idea of which one to choose that you don't need to think about it further

Show thread

solo 1d ago

@whitequark @avioletheart.me though also you don't have to worry about there being any AI in most of these, as they were written like several years ago lol