These days, the most effective way to search for music is by either using the song name to query a large database of music, or using any of the metadata keywords embedded in the music file to narrow a search on a music database until only a single match is found. What happens though if we don't have any information about the song name or metadata about the song we are looking for? This situation happens more often than one might realize. When watching a television commercial or show, one might hear a very catchy tune, a tune to a song that they have perhaps long forgotten, and they might then be interested in finding the song that the tune belongs to. One technique that humans use effortlessly, but for which few computer applications have been built, is searching for music using another piece of music. When looking for a song, humans can go through all the songs in their playlist, listening to each, and judging whether the currently playing song is the song they are looking for. In this project we explore this idea, and we ask: given a short audio fragment extracted from a piece of music, can we retrieve the full-length song (or its equivalent) that the audio fragment came from?
Previous work in this area concentrates on audio retrieval in monophonic music (i.e., songs with one dominant melody or music in which all the notes sung are in unison ( Monophony )). When assuming a single note at a time, the problem of audio search becomes easier as one can now represent the notes with strings, and apply string matching algorithms to align an input audio string to a database of strings representing the full length songs. Other work has borrowed ideas from speech recognition, where each note in the audio search term is decoded using an HMM, and a top level HMM is used to ``stitch'' the notes together into a song. This idea works well when there is only one dominant note in any given time interval (as in monophonic music), but produces poor results for polyphonic music.
In this project, we will consider different ways of representing music, and assess how well these feature descriptors/vectors produce matches between the search term and the target song. We plan to exploit episodic data in the audio search terms, and match them up to the target song. At a high level, for good matches between the search term and a queried song, one can imagine that the occurrence of these feature vectors should align almost perfectly for the region in which the audio fragment occurs in the longer queried song. Clearly in such a match, the sequence in which these feature vectors occur in the audio search term, as well as their relative placements with each other matters, and so any searching technique has to model both types of information. We will use an HMM to discover the most likely song which has the same sequence of feature descriptors embedded in it as the audio search term. The feature descriptors considered will be average pitch and amplitude of the audio waveforms, MFCC feature vectors for audio, as well as other feature descriptors that may be useful to describe audio. The feature descriptor(s) that provide the best performance will used for the final searching algorithm.
Any library of music can be used as a dataset, and for this project I will use my own music library (9GB of labelled mp3 files of various genres (House, Classical, Reggae etc)) as both training and test data.
I will conduct a small feasibility study during the first week using a packaged HMM library HTK, and select the appropriate feature descriptors for the music. Once I am confident with the performance of the suggested technique, the next two weeks will be dedicated to coding the HMM, implementing the viterbi algorithm, as well as the backward-forward algorithm. The final part of the project will be dedicated to testing and improving the search pipeline, creating results and writing the proposal. I hope to have preliminary classification results generated from the HMM and forward algorithm by the project milestone. In detail, here is the timeline: