Backing up Spotify
112 points by Aks
112 points by Aks
Scraping Spotify is based but I also disagree with every point in their reasoning.
Over-focus on the highest possible quality. Since these are created by audiophiles with high end equipment and fans of a particular artist, they chase the highest possible file quality (e.g. lossless FLAC).
Yes, that's great for archival.
This inflates the file size and makes it hard to keep a full archive of all music that humanity has ever produced.
No, you can just transcode media. I believe private trackers incentivize creating torrents for both a lossless and a "decent" lossy version of an album too, for instance.
No authoritative list of torrents aiming to represent all music ever produced. An equivalent of our book torrent list (which aggregate torrents from LibGen, Sci-Hub, Z-Lib, and many more) does not exist for music.
Private trackers? I do agree that they're not perfect for preservation: they're, well, private, some rules are ridiculously strict (why the hell do you insist people use their home IPs on a website dedicated to commiting felonies?) - on the other hand I heard the incentives for long-term seeding are good... but I doubt that's necessary [1, 2].
Still - I've (sadly) used Spotify enough to see the tail of badly tagged albums, missing tracks, etc. I would absolutely not call a Spotify scrape authoritative.
Over-focus on the most popular artists. There is a long tail of music which only gets preserved when a single person cares enough to share it.
...is also hilarious considering how many artists I know aren't on Spotify, with even pretty popular ones removing their music from the platform [1, 2, 3]. Just yesterday I've noticed half of the tracks from an EP I used to listen to a lot are missing.
I also don't really see the issue with the "only gets preserved when a single person cares enough to share it" part. That's how most pirate libraries function, and they're doing great. It can be spun as a positive - there's a lot of AI slop on Spotify nowadays, and the "someone has to care about it" step filters it out. Not that it's a high bar either. I've heard of people buying obscure records and ripping them just to get some buffer on a private tracker.
For popularity>0, we got close to all tracks on the platform. The quality is the original OGG Vorbis at 160kbit/s. [...] For popularity=0, we got files representing about half the number of listens (either original or a copy with the same ISRC). The audio is reencoded to OGG Opus at 75kbit/s
Jesus. So we're re-encoding from one lossy codec to another? That doesn't strike me as a great idea.
For now this is a torrents-only archive aimed at preservation, but if there is enough interest, we could add downloading of individual files to Anna’s Archive. Please let us know if you’d like this.
Please don't. The last thing we need is RIAA on your ass. Anna's Archive is absolutely amazing for for ebook piracy, but we already have much better alternatives available for music (that are just as easily accessible, but superior in both quality and available material).
This makes me think of the "free lending" blunder archive.org made around the lockdowns (IIRC). It put them in a lot of unnecessary legal trouble, threatening their other, IMO much more valuable, endeavors. Obviously the situation here is a bit different but I'm still uneasy about this whole thing.
It's surprising they've gone with Spotify, since I think major services like Deezer, Tidal, and Qobuz all have fairly large collections that can be ripped as well and they're FLAC.
It's a bit funny how private trackers have become bastions of media archival. I'm increasingly convinced that the copyright illusion must crumble at some point and get reformed to adapt to the digital world.
I've heard it claimed that the major private tracker for movies is actually commonly used by people in Hollywood. For example, if a prop creator needs to print out reference screenshots, they'll reach for a DRM-free bluray rip. If you need access to a giant library of every movie ever made, there aren't any better options that I'm aware of.
The private tracker best known for audiobooks has a surprisingly deep collection, but I wonder how it compares to the full Audible collection. I think you're correct that generally if there's even the slightest bit of interest in an audiobook it'll get requested and ripped by some contributor, so the stuff that doesn't get archived tends to be fairly niche or unpopular. With that being said, I think some academic-related content exists which hasn't made it onto the Internet. For example: I've found conflicting information about an audiobook recording of The Road To Reality, by Roger Penrose having been made, but all the places that offer to sell it only have shadow listings.
the stuff that doesn't get archived tends to be fairly niche or unpopular
I do think niche and/or unpopular stuff is worth preserving too, FWIW. I also think it deserves better than a crappy transcode :)
edit: Wait. Audiobooks! I hadn't even thought of that. It would be really interesting if Anna's Archive branched out into them. Maybe the Spotify release is meant to be a warm-up of sorts?
It's a bit funny how private trackers have become bastions of media archival. I'm increasingly convinced that the copyright illusion must crumble at some point and get reformed to adapt to the digital world.
In theory, this is what the AI companies are doing. I don't think artists are liking that though.
I sort of agree with all your points, but I think there are good reasons for what they're doing, mostly involving hard tradeoffs.
Over-focus on the highest possible quality. Since these are created by audiophiles with high end equipment and fans of a particular artist, they chase the highest possible file quality (e.g. lossless FLAC).
Yes, that's great for archival.
It is great on archival, but it'd also 10x the size of their archive. Given their budgetary constraints, they're prioritizing storing a very large number of decent quality cultural artifacts, and a very very large number of unpopular low quality cultural artifacts, over storing a smaller number of very high quality artifacts.
Private trackers?
Even Redacted has less than 3 million releases. I do think getting everything from there would be amazing, and it'd be a good complement to get the stuff missing from Spotify, but they want to get stuff that isn't well preserved, and a lot of stuff on private trackers is actually pretty well preserved. There are a bit under 3 million releases on there, I wonder if metadata regarding the total size of it all is available somehow?
I wonder if metadata regarding the total size of it all is available somehow?
The sidebar should list some of these statistics at the very least (unless they removed that feature from their Gazelle fork).
Jesus. So we're re-encoding from one lossy codec to another? That doesn't strike me as a great idea.
I doubt that 75kbit Opus ripped from 160kbit Vorbis sounds much different from 75kbit Opus ripped from CD/FLAC. Once you've made the decision to store the low-demand stuff with lower file-size, compressing whatever you have to Opus is about the best you can do.
I doubt that 75kbit Opus ripped from 160kbit Vorbis sounds much different from 75kbit Opus ripped from CD/FLAC.
True, insofar as it'll accumulate the behaviour of both lossy codecs at the extreme. 160KB/s Vorbis is generally considered transparent whereas 75KB/s Opus is definitely not.
Once you've made the decision to store the low-demand stuff with lower file-size, compressing whatever you have to Opus is about the best you can do.
Best we can do for now.
I, for one, have been around long enough to remember when a different codec and bitrate were the best we could do.
While reasoning is interesting to discuss, I wonder how it was technically made — I mean how much of the infrastructure "annas-archive" should build to rip/consolidate stuff and make it available? How much does it cost? How many engineers build and maintain that? I wrote that and understood that I have pretty much same questions on books side of the service. How the hell they are doing that?
(UPD: a bit on technicalities is here https://archive.is/XjXqk )
One of the very rare websites being blocked here in Germany (copyright stuff and via DNS I suppose, not in the mood to investigate right now).
In germany its copyright, but not by the government, but by lobbied isps banding together and blocking everything they dont deem servable, for instance annasarchive, archiveis sometimes and a whole lot of tracker pages like pirate bay. But its dns based, so easy to circumvent
While this article talks about the back side of Spotify (it’s data and metadata), I’m more interested in alternative frontends for Spotify, esp on iPads. Does anyone have good recommendations? My kids listen to a lot of audiobooks in Spotify’s catalog, and iOS parental controls do not apply for some reason to Spotify’s app, so I can’t limit their listening time :(
Have you opened a ticket with Spotify to inform them of their shortcoming? I don't develop for iOS but integrating parental controls seems like something they'd consider if it got them more subscriptions from folks who were in your situation
No authoritative list of torrents aiming to represent all music ever produced.
what.cd aimed to do this, but has since gone dark. redacted rose from the ashes but it also takes time.
downloading lossy files and then re-encoding at even lower kbps is laughable and invalidates the entire project.
While 300TB is a lot for a person at home, having a distributed mirror in ipfs would be nice.
And songs would end up cached all over the place.