Unsupervised Learning of Basic Music Features

Richard Lange


Introduction

Analysis of music reveals that there are fundamental mathematical rules that govern musical progressions and harmony. Although many musical features obey simple mathematical relationships, they combine in countless ways to elicit a wide range of emotions; and, this aspect of music composition is not fully understood. Because we know that music obeys some structure but cannot identify it, the problems of music classification and artificial composition lend themselves well to machine learning algorithms.

The processes of parsing music structure and composing new music artificially both require a foundation of musical features. In other words, before we can describe how the parts of a composition interact, we must have a way to define those "parts." Once simple features have been defined, we can investigate how they combine in complex ways to form more abstract structures. In most cases, the simple features are derived from music theory: chord progressions, melodic riffs, syncopation, etc.

Other fields of machine learning such as speech and image recognition all tend to reach the same conclusion: hand-coding low-level features in a recognition system introduces a strong bias. For example, many image recognition algorithms define the simplest features in terms of "edges" (low-high or high-low contrast at different orientations), from which more complex features such as longer lines, corners, and curves can be built. There has been promising work [1, 2, 3] on learning the set of simple edge and contrast features from a set of images.

This concept of learning simple features has been extended to music in the past few years with varying amounts of success [4], literally transcribing the algorithms developed for image processing. I plan to do something similar to this.


Project Statement

The Big Question: Will unsupervised learning of simple music features result in better categorization (and therefore recognition) of musical styles?

Bonus Questions: To what extent do these automatically learned features resemble features informed by music theory? Also, can they be inverted to be a generative process instead of recognition (see artificial composition below)?


Data

I have downloaded MIDI files of piano songs. I chose piano because I want to focus on a single instrument, at least to begin with, and I am most familiar with piano. I will begin with music by Scott Joplin because his music is particularly recognizable and relatively simple temporally. While training the system on a single composer may reveal some interesting characteristics of that composer's music, it should be trained on multiple composers in multiple genres to develop the most general set of features. Once the algorithm works for Joplin, I will test it on other composers as well.

I will be using a 3rd party MatLab toolbox for midi interaction. This toolbox is not complete, and throws away some data when importing into MatLab such as velocity (loudness), sustain, and tempo. Whether or not this will be a problem is up for debate - there are strong arguments for both which I will briefly outline. The obvious argument is that any data thrown away in preprocessing, whether intentional or incidental, is removing information, and less information means less encompassing features. The other argument, which I support, is that having too much data can be distracting to an algorithm looking for universal descriptors. One study found that their system was only able to parse music once they had preprocessed it to a simple representation like the one I will be using [5].


Method

As mentioned above, algorithms from machine vision to generate simple features can be easily extended to music. The basic idea behind all of these algorithms is to keep a bank of convolution kernels, each of which represents some feature. The machine learning task is to find the set of kernels which best represents the salient or most common aspects in the test data. Many algorithms also maximize the sparseness of these kernels. Algorithms I am considering include Convolutional Sparse Coding, Invariant Predictive Sparse Decomposition, and Shift Invariant-Probabilistic Latent Component Analysis.

In Choice of Music Representation, below, I discuss how I plan to format a song in a 3D array in order to highlight the important dimensions (timing/beat and pitch).

Once I have generated some features, I will both inspect them by hand and try them in a music genre or artist classifier to automatically gauge performance. The classifier will most likely be an SVM.


Milestone

By the milestone deadline, I hope to have coded and tested at least one of the algorithms mentioned above and have a preliminary set of features. I expect that interpreting the results I get will be an ordeal in itself; I would like to compare my results to classic music-theory-based features in addition to evaluating the classification accuracy of my new features.


Miscellaneous


References

  1. Kavukcuoglu, Koray. Learning Convolutional Feature Hierarchies for Visual Recognition. 2010.
  2. Kavukcuoglu, Koray. Learning Invariant Features through Topographic Filter Maps. 2009.
  3. Olshausen, Bruno. Emergence of Simple-Cell Receptive Field Properties by Learning Sparse Code For Natural Images. 1996.
  4. Nieto, Oriol. Unsupervised Music Motifs Extraction. 2010.
  5. Dubnov, Shlomo. Using Machine-Learning Methods for Musical Style Modeling. 2003.