How The Heck Does Shazam Work?
89 points by indigo
89 points by indigo
This was a great read. Those interactive bits were top notch!
Very cool! I saw AcoustID a few days ago, which is an open-source song identification database. Does that also work the same way as Shazam?
AcoustID seems to be built around Chromaprint, which has a nice little explanation of how it works. It seems like Chromaprint just takes the spectrum of the audio and pulls 12 equal-temperament pitches per octave out of it, and then there's some further cooking to get it into a hashable "fingerprint".
The biggest difference I see vs how Shazam works is that Shazam only cares about the onset of frequencies (they call them "landmarks"), so you end up with discrete 2D points (x axis is time, y axis is frequency) rather than "spans" of frequency like Chromaprint.
A funny note here is that Shazam is patented to the absolute gills, so what their competitors have had to do is use anything except using raw frequency in their "landmarks". I've seen competitors using MFCC, Bark scale, and I have a dim recollection of the late Echonest's project Echoprint just using an arbitrary 8-band scale. There are upsides and downsides to each approach, but in my opinion, they're all just to dodge the Shazam patents.
Looks similar according to this write up about the underlying library:
You can get this kind of image by splitting the original audio into many overlapping frames and applying the Fourier transform on them
But instead of peaks:
Chromaprint processes the information further by transforming frequencies into musical notes
And then:
[...] we apply a pre-defined set of 16 filters that capture intensity differences across musical notes and time [...] You can basically take any of the six filter images, place it anywhere on the subimage and also make it as large as you want (as long as it fits the 16x12 pixel subimage). Then you calculate the sum of the black and white areas and subtract them. The result is a single real number.
There are similarities, but the Chromaprint system doesn't support recognising a track from a few seconds of recording, which is what the Shazam system specialises in.