Mastodawn

one of the reasons it's useful that git-pages is so efficient storage-wise is that i don't actually need to worry about what people are uploading. there are no cases where reasonable-seeming use of the website would generate problematic amounts of resource use, so i don't have to watch over resource use. it's Fine.

Show thread

✧✦Catherine✦✧2d ago

someone uploaded two copies of a site containing Every Manpage, which is half a gigabyte (each)? it only costs 36 MB to store it (total). it's Fine.

Show thread

solo

@whitequark is this compression or just file-level deduplication (or maybe block-level?)

Show thread

✧✦Catherine✦✧2d ago

@solonovamax compression and file-level deduplication across the whole service instance

Show thread

✧✦Catherine✦✧2d ago

@solonovamax block-level deduplication is something that's on the table but doesn't seem necessary from the data so far

Show thread

solo 2d ago

@whitequark block level deduplication just also sounds annoying to do

Show thread

Nicolás Alvarez 2d ago

@solonovamax @whitequark I'm working on something using block level deduplication and I can confirm it's annoying.

Show thread

✧✦Catherine✦✧2d ago

@solonovamax it's mostly just that figuring out where the block boundaries are is a real pain; integration with the compressor pays off but i'm unfamiliar with the domain

Show thread

mei 2d ago

@whitequark @solonovamax i believe the keyword is content-defined chunking? it's surprisingly simple once you dig into it, imo

Show thread

Simon Sapin 1d ago

@mei @whitequark @solonovamax i learned about this in https://github.com/bup/bup/blob/main/DESIGN.md#handling-large-files-cmdsplit-hashsplitsplit_to_blob_or_tree and again in https://restic.readthedocs.io/en/stable/100_references.html#backups-and-deduplication (though the latter is light on details)