Mastodawn

Anna's Archive backed up Spotify. They got 99.9% of metadata, and 300TB of music representing 86 million tracks - original 160kbps OGG for tracks with popularity>0, and re-encoded 75kbps for popularity=0. absolutely wild project.

the metadata in particular is a hugely useful data source. MusicBrainz catalogues 5 million unique ISRCs (like ISBNs but for music releases), whereas this archive has a whopping 186 million.

https://annas-archive.li/blog/backing-up-spotify.html

Backing up Spotify

We backed up Spotify (metadata and music files). It’s distributed in bulk torrents (~300TB). It’s the world’s first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens.

Show thread

Graham Sutherland / Polynomial Dec 20

this solves a major problem I ran into when writing automation tools for maintaining my own music library: the metadata sources are missing so much stuff, so you generally end up needing to query the Spotify API, which isn't sustainable. even with it inherently being a snapshot in time, archived metadata solves a ton of headaches.

Show thread

halcy

Dec 20

@gsuberland damn 300tb is like. An amount you could feasibly just store at home

Show thread

Friedrich Hartmann Dec 20

@halcy @gsuberland it is, if you have 10k-15k spare money. Plus don’t shy away from like extra 200 to 300W home extra heating 24/7.

It’s not that bad, could als just get you 256GB of latest RAM and top tier GPU. Or 2 top spec MBP. It is bad.

Show thread

Graham Sutherland / Polynomial Dec 20

@friedrich @halcy with recertified enterprise disks and second hand HBAs you could store this for a lot less than 10k.

Show thread

Raptor

Dec 21

@gsuberland @friedrich @halcy never underestimate the power of walking into a microcenter and looking for sales on the cheapest, slowest barracudas, you can generally get the 4tb ones around $50.

Show thread

Graham Sutherland / Polynomial Dec 21

@raptor85 @friedrich @halcy those are typically SMR which is terrible for RAID due to write amplification. you're better off getting recert enterprise drives for that reason; they're all CMR.

Show thread

Ignas Kiela Dec 21

@gsuberland @raptor85 @friedrich @halcy if the goal is mass ammounts of cheap storage, I wouldn't bother with raid, presumably because other people also have copies, you could just re-download stuff if a drive failed

Show thread

Graham Sutherland / Polynomial Dec 21

@ignaloidas @halcy @raptor85 @friedrich yes, depending on how the torrents are organised you could indeed have one drive per chunk and do it that way.

Show thread

Ignas Kiela Dec 21

@gsuberland @halcy @raptor85 @friedrich I don't see how you couldn't unless the torrents contain massive 1tb files, because you can do partial downloads even on a single large torrent otherwise

Show thread

Raptor

Dec 21

@ignaloidas @halcy @gsuberland @friedrich just set the disks up as a good ol jbod, torrent doesn't need to be broken up then, and no striping so if a drive fails you only need to re-sync files that were on that part after replacing it.

Show thread

Ignas Kiela Dec 21

@raptor85 @halcy @gsuberland @friedrich if you have a separate fs on each of them (and IMO you should to isolate the failures), then you kinda have to split up where parts of the torrent are stored

Show thread

Raptor

@ignaloidas @halcy @gsuberland @friedrich a linear volume w/ ext4 cleans up pretty painlessly if a drive in the array fails, the nice part about that setup is you can just dump the torrent to it and not care, and since you can keep the entire torrent structure intact if files go missing or get corrupted you can just resume download on that torrent again.

Is it the cleanest, safest way to do things? No, but it's cheap AF and easy, and failures should be rare since you shouldn't really be writing

Show thread

Ignas Kiela Dec 21

@raptor85 @halcy @gsuberland @friedrich on one hand, it's probably fine, on another, I really wouldn't want any chances that a dead drive results in dead/corrupted directory structures of data in other drives, and without having really good knowledge about the filesystem I'd be using, I wouldn't go this way