Anna's Archive backed up Spotify. They got 99.9% of metadata, and 300TB of music representing 86 million tracks - original 160kbps OGG for tracks with popularity>0, and re-encoded 75kbps for popularity=0. absolutely wild project.

the metadata in particular is a hugely useful data source. MusicBrainz catalogues 5 million unique ISRCs (like ISBNs but for music releases), whereas this archive has a whopping 186 million.

https://annas-archive.li/blog/backing-up-spotify.html

Backing up Spotify

We backed up Spotify (metadata and music files). It’s distributed in bulk torrents (~300TB). It’s the world’s first “preservation archive” for music which is fully open (meaning it can easily be mirrored by anyone with enough disk space), with 86 million music files, representing around 99.6% of listens.

this solves a major problem I ran into when writing automation tools for maintaining my own music library: the metadata sources are missing so much stuff, so you generally end up needing to query the Spotify API, which isn't sustainable. even with it inherently being a snapshot in time, archived metadata solves a ton of headaches.
this massively escaped containment so imma mute now
@gsuberland damn 300tb is like. An amount you could feasibly just store at home

@halcy @gsuberland it is, if you have 10k-15k spare money. Plus don’t shy away from like extra 200 to 300W home extra heating 24/7.

It’s not that bad, could als just get you 256GB of latest RAM and top tier GPU. Or 2 top spec MBP. It is bad.

@friedrich @halcy with recertified enterprise disks and second hand HBAs you could store this for a lot less than 10k.
@gsuberland @friedrich @halcy never underestimate the power of walking into a microcenter and looking for sales on the cheapest, slowest barracudas, you can generally get the 4tb ones around $50.
@raptor85 @friedrich @halcy those are typically SMR which is terrible for RAID due to write amplification. you're better off getting recert enterprise drives for that reason; they're all CMR.
@gsuberland @friedrich @halcy for <$50 per drive and at worst a lifespan of a decade even under moderate loads it barely matters, especially for something like this that's going to be mostly write-once then read only at a moderate rate for 1-2 users you'll get WAY more bang for your buck just going as cheap as possible. Don't forget the second version of the RAID acronym...redundant array of INEXPENSIVE disks
@raptor85 @friedrich @halcy keep in mind I'm speaking from experience of running a 48TB NAS at home here. sufficient quanties of 4TB disks is going to be an absolute horror to stack to 300TB+, especially if they're SMR.

@gsuberland @friedrich @halcy oh, I've done worse.

I also have some pretty large clusters of SMR disks that have been running for ages, unless you're doing some intensive tasks like using it to record 4k video streams from a security system you really barely see any reliability differences. I use enterprise disks when building large clusters for clients but tbh most of the difference you get in those for the price is the warranty, plus who am i to judge if they want to pay me more.

@gsuberland @friedrich @halcy CMR drives are fantastic, don't get me wrong, but the cost difference doesn't really amount to any benefit if you don't need to constantly be writing large amounts of data to them.
@gsuberland @raptor85 @friedrich @halcy oh yes you definitly don't want to go with drives smaller than ~12TB for such a project even of your power is cheap ^^'
SMR could be ok-ish if you do archiving with something that isn't a RAID. With new 18TB drives we are talking about 6k€ in drives if bought new and about half for used/recert drives.
@raptor85 @gsuberland @friedrich @halcy Per gigabyte, the currently cheapest disks are 8-12 TB. Prices have gone up, but a little over a year ago I got a bunch of “refurbished” (really just erased) 12 TB Ultrastars for about $85 each. Helium-filled (quieter, cooler, and more reliable than air-filled), CMR, SATA (SAS are slightly cheaper because fewer people can use them).
@gsuberland @raptor85 @friedrich @halcy if the goal is mass ammounts of cheap storage, I wouldn't bother with raid, presumably because other people also have copies, you could just re-download stuff if a drive failed
@ignaloidas @halcy @raptor85 @friedrich yes, depending on how the torrents are organised you could indeed have one drive per chunk and do it that way.
@gsuberland @halcy @raptor85 @friedrich I don't see how you couldn't unless the torrents contain massive 1tb files, because you can do partial downloads even on a single large torrent otherwise
@ignaloidas @halcy @gsuberland @friedrich just set the disks up as a good ol jbod, torrent doesn't need to be broken up then, and no striping so if a drive fails you only need to re-sync files that were on that part after replacing it.
@raptor85 @halcy @gsuberland @friedrich if you have a separate fs on each of them (and IMO you should to isolate the failures), then you kinda have to split up where parts of the torrent are stored

@ignaloidas @halcy @gsuberland @friedrich a linear volume w/ ext4 cleans up pretty painlessly if a drive in the array fails, the nice part about that setup is you can just dump the torrent to it and not care, and since you can keep the entire torrent structure intact if files go missing or get corrupted you can just resume download on that torrent again.

Is it the cleanest, safest way to do things? No, but it's cheap AF and easy, and failures should be rare since you shouldn't really be writing

@raptor85 @halcy @gsuberland @friedrich on one hand, it's probably fine, on another, I really wouldn't want any chances that a dead drive results in dead/corrupted directory structures of data in other drives, and without having really good knowledge about the filesystem I'd be using, I wouldn't go this way
@ignaloidas @halcy @gsuberland @friedrich yeah, honestly the drives and array setup unless you're using expensive 20+ TB drives are less of a problem than physically fitting them, I have some nice 8 port pcie SATA cards that I use for some of my stuff but even on risers having 10 of those cards in a case is, at the very least, a wiring nightmare, lol
@gsuberland @raptor85 @friedrich @halcy second this. You can get HGST 4TB used enterprise drives right now fo $50. I have a raid5 of a bunch of 2TB ones which has given me very little trouble .. except for the one 2TB barracuda that a relative gave me that I retired when it started seeing very high seek error rates.
@raptor85 @gsuberland @friedrich @halcy well, if you are into enterprise pirating, hp 3par is the way to go
@gsuberland @halcy at 300W you can rule out 4TB disks here. I was talking about a system with 20GB plus per drive. Just saying that at somewhere around with new HDDs somewhere below 25 EUR/TB + Prebuilt 16 Bay Nas and some Spare/Error correction drives 300 TB are in the price range of a single high end gaming system/workstation these days. Mostly due to sad state of RAM but also in price range of 70s to 90s high end PC. So not out of reach for domestic setups.
@gsuberland @halcy That you can get the cost down significantly with shucking external drives and used server discs is possible it wanted to give a realistic price for new hardware as a ball park.
@friedrich @gsuberland @halcy 20+ TB drives still have significantly higher failure rates according to the last report I read though, something to consider. 16TB drives were more stable.
@friedrich @halcy @gsuberland It's about half that unless you want to use RAID. And the important thing is that if you have like 100 people serving the content in their home servers, it's feasible for them to have all of it.
Open Source Storage Server: 60 Hard Drives 480TB Storage

Get the details on Storage Pod 6.0 and learn how to build your own data storage server.

Backblaze Blog | Cloud Storage & Cloud Backup

@gsuberland this is a *super* interesting dataset

im also a bit surprised that you often encounter missing metadata in MusicBrainz, since it already often feels super overwhelmingly comprehensive to me, to the point that its often hard to search. but then Spotify contains 15x more songs (at least?). and there's a bunch of stuff not on Spotify!

@operand @gsuberland I have at least three albums that are not in the MusicBrainz database, and I don’t even have that many albums, maybe 30.
@gsuberland
Not to mention, MusicBrainz is contaminated with a lot of input from the wrong end of the Bell Curve…
@gsuberland This site can’t be reached
The connection was reset.
Try:
• Checking the connection
• Checking the proxy and the firewall
• Running Windows Network Diagnostics
ERR_CONNECTION_RESET
@gsuberland Fuck. I'm gonna need a bigger NAS.
@gsuberland lol apparently i need to read that blog over tor now in nl
@dysfun just switch your dns to cloudflare or google
@dysfun @gsuberland (Not if you add 'de.' as a subdomain of the domain, because lmao)
@joepie91 not working (anymore?) on German O2 connection
@luc Germany also blocks the site?
@joepie91 indeed, they inject a redirect to the screenshotted page (idk if with dns or IP interception, could check if you like)
@gsuberland will there also a port of the spotify app? xD
@gsuberland
Christ, the RIAA is going to have a pink fit!
@gsuberland Very cool. Very, very cool. Hats off 🎩
@gsuberland is this legal?
@cubeofcheese @gsuberland illegal only if you don't use it to train ai
@cubeofcheese better question: is the cause noble?
@cubeofcheese @gsuberland of course no. I think it will end really bad for them
@hiphopheaven @cubeofcheese given the rest of what they do I hardly think this is a meaningful escalation.
@gsuberland Just say it's for LLM training, that'll allow them to skirt the law.

@gsuberland This is so cool.

Anna's archive, like The Internet Archive is doing Good Things to archive the internet, slightly differently. While TIA builds their own infra and storage, AA stores the collected archives distributively using torrents.

So if you have some spare GB/TBs of drive space and a headless torrent node (I use #deluge its quite awesome), go to https://annas-archive.org/torrents and seed some low-seed torrents. Their selector even right sizes the torrents to fit your free space.

Torrents - Anna’s Archive

The world’s largest open-source open-data library. Mirrors Sci-Hub, Library Genesis, Z-Library, and more.

@tezoatlipoca @gsuberland as the site seems to be offline due to copyright problem, here is a backup.
https://archive.ph/HQP3T

Maybe add the link to the original post in addition.

(Edit: Not down, but DNS blocked in Germany (Telekom))
@gsuberland not complaining but like... doesn't Spotify also have lossless flacs of tracks too? Or is it only OGGs?
@Starcross @gsuberland isn’t that a much newer feature that is still not widely available
@Starcross read the blog post, they explain why they chose to archive the OGGs.