one of the reasons it's useful that git-pages is so efficient storage-wise is that i don't actually need to worry about what people are uploading. there are no cases where reasonable-seeming use of the website would generate problematic amounts of resource use, so i don't have to watch over resource use. it's Fine.
someone uploaded two copies of a site containing Every Manpage, which is half a gigabyte (each)? it only costs 36 MB to store it (total). it's Fine.
by designing the whole thing with scale and efficiency in mind, i can be providing a hyperscale-style service without having hyperscale infrastructure, and i think that's a worthy achievement

an amusing outcome of this is that it is almost too efficient. one site i've examined because it showed up at the top of the list of biggest sites. turns out it contained, by accident i assume, an entire copy of ESP-IDF build directory full of object files

git-pages is efficient enough that you don't usually notice these things because something gets slow and you investigate

@whitequark i love a good efficiently-designed piece of software

@whitequark As someone who's done a bit of scale here and there over the decades, I can definitely appreciate it. It's nicely done.

I also like this project because maybe it means I can finally let go of this long-neglected project: https://joshisanerd.com/projects/undertaker/ (tl;dr: adds tooling to make it a single command to generate the site, with a static HTTP git repo, and then scp/sftp/whatever-s it up. I don't think I've sent this your way before, apologies if I have.)

Undertaker: helping you share your projects

@AJ9BM ah yeah definitely! also re: archivability, git-pages lets you download the entire thing from the /.git-pages/archive.tar endpoint (although not in all cases because of concerns about enumeration)
@whitequark is this compression or just file-level deduplication (or maybe block-level?)
@solonovamax compression and file-level deduplication across the whole service instance
@solonovamax block-level deduplication is something that's on the table but doesn't seem necessary from the data so far
@whitequark block level deduplication just also sounds annoying to do
@solonovamax @whitequark I'm working on something using block level deduplication and I can confirm it's annoying.
@solonovamax it's mostly just that figuring out where the block boundaries are is a real pain; integration with the compressor pays off but i'm unfamiliar with the domain