How Music-Identification Apps Work

How Music-Identification Apps Work Featured Image

Since it started in 1999, Shazam has been used to identify songs over fifty billion times, and that’s not even counting the IDs from Soundhound, MusicID, and other sound-recognition apps.

From a user’s perspective, it’s simple: Start the app, press a button, and let your phone listen to the song. After a few seconds, even with background noise and distortion, the app will tell you what the song is. It works so quickly and so well that it almost seems like magic – but, as with most magical things these days, it’s mostly run by algorithms.

What’s the idea behind these apps?

music-identification-apps

Shazam, Soundhound, and other music-identification services all work basically the same way: they have a big database of song information, an algorithm that can quickly extract information from your song sample, and an app to let you interface with those things. Technically, you don’t even need a smartphone.

Shazam was originally usable on old-fashioned flip phones by just recording a song and texting it to the service. Soundhound has actually gone a few steps further by also enabling you to sing or hum into their app which they match against a user-submitted database of other singing/humming recordings.

How do they work?

music-fingerprint

In simple terms, the process looks like this:

  1. The app’s database has a massive collection of song “fingerprints,” or small pieces of data about the song’s unique sound patterns.
  2. When a user hits the “Record” button, the app listens to the music and creates a fingerprint based on the few seconds of audio it hears.
  3. This fingerprint is checked against the database of existing fingerprints. If your ten-second fingerprint is a match to part of a song, you get your (hopefully correct) song result. If it’s not, you’ll get back an error.

If you’re just looking for a surface-level explanation, that’s all you need to know. The really interesting part is how you actually get that fingerprint.

Song fingerprints

music-recognition-hashing

It all starts with a spectrogram, like the one in the graph above, taken from a paper written by one of Shazam’s founders, Avery Wang. This is essentially a graph with time on the x-axis (horizontal), frequency on the y-axis (vertical), and amplitude represented by different levels of color intensity. Any sequence of sounds can thus be converted into a spectrogram, and any point on the spectrogram can be assigned a set of coordinates. Just like that, notes can be numbers.

If all you needed to do was match a few sounds to each other, you could stop here. If you want to look through a database full of millions of songs, though, a full-detail spectrogram has way too many data points to look through at any sort of speed.

The big breakthrough in music recognition was the realization that you can identify sounds with only a few pieces of data: the peaks, or the most intense parts. Not only does getting rid of most of a song’s lower-energy parts decrease the size of the spectrogram, but it makes the apps less susceptible to identifying dull, consistent background noise as part of the target sounds. Imagine a city skyline – the most identifiable parts are the tops of buildings, not the middle floors, and that’s what you can see from farthest away.

So every second of every song is stripped down to just a few of the most intense data points; everything on the city skyline is removed except the very top. But that’s still not quite efficient enough to be immediately searchable, so the next step is to “hash” this sequence of peaks. Hashing simply takes a set of inputs, runs them through an algorithm, and assigns them an integer output. In this case the hash is generated by taking two of the high-intensity peaks, measuring the time between them, and adding their two frequencies together.

The result is a string of numbers, easily storable and searchable. When a computer reads this hash, it will recognize them as representing frequency and time-distance. Once all the peaks in the song have been identified and hashed, the transformation is complete: the song now has a unique 32-bit number that serves as its ID in the database. More importantly, every second of the song is represented by the numbers.

When your phone hears music, it goes through this exact process: it filters out everything but the highest points, hashes them, and creates a fingerprint for the few seconds it has recorded. Once this is complete, your phone just needs to see where the corresponding strings of numbers appear in the database, allowing it to match the detected frequencies and timing to the correct song and returning it to you in seconds.

Music and more

This technology has been most widely used for music recognition, but sound recognition apps can also work with movies, commercials, TV shows, bird songs, and more. Shazam and Soundhound are the most well known, but you can also now ask Google what song is playing and get an accurate response.

And if you’re wondering, “Do these companies keep track of which songs get asked about?” the answer is “yes.” Music identification statistics have actually been able to predict the success of songs and artists with a fairly high level of accuracy, and big record labels like Warner have contracted with apps like Shazam to help find up-and-coming artists. So, if you want to support an artist, you may as well do your part and look up their song! You may just help them take off.

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Andrew Braun Avatar

Read next

When Sony shipped the first Walkman in 1979, chairman Akio Morita insisted on a second headphone jack and a “hotline” talk button, convinced it would be rude for one person to listen to music alone — and within a few years buyers had ignored the sociable features so completely that Sony quietly dropped them
Russia still custom-builds the Soyuz return seats for ISS crew members using plaster casts taken weeks before launch, because astronauts grow as much as five centimetres taller during a long-duration stay and a seat moulded to their Earth-shaped spine would no longer fit the body that comes home
The “CrackBerry” nickname stuck for a reason — and the variable-reward psychology that hooked early-2000s executives on their BlackBerrys is the exact same machinery now running every push notification on every smartphone in your pocket
In 1843, Ada Lovelace described a brass-and-punched-card engine that could act on symbols as well as numbers, even composing music if harmony could be reduced to rules, inside seven translator’s notes three times longer than the paper itself
ARPANET sent its first message on 29 October 1969 from a lab at UCLA to a machine at Stanford, and the message was supposed to read ‘LOGIN’ — but the system crashed after the L and the O, meaning the first word ever transmitted over the network that became the internet was, by accident, ‘LO’.
In 1995, Microsoft shipped a cartoon-house interface called Bob, led by Melinda French, who married Bill Gates while it was in development — it demanded twice the memory of a typical home PC, sold roughly 30,000 copies, and was dead within a year, leaving behind the font Comic Sans and the animated assistant that became Clippy.
The Greenland shark grows about one centimetre a year, does not reach sexual maturity until around age 150, and a specimen carbon-dated by Danish researchers in 2016 was estimated to be at least 272 years old, meaning it was already swimming the North Atlantic when Mozart was composing symphonies.
When Apple shipped iOS 12 in June 2018, a small feature called Screen Time slipped onto every iPhone with a counter nobody had quite prepared for — a tally of pickups — and within a day Tim Cook was telling CNN the number of times he picked up his own phone was simply too many