Analysis of music reveals that there are fundamental mathematical rules that govern musical progressions and harmony. Although many musical features obey simple mathematical relationships, they combine in countless ways to elicit a wide range of emotions; and, this aspect of music composition is not fully understood. Because we know that music obeys some structure but cannot identify it, the problems of music classification and artificial composition lend themselves well to machine learning algorithms.
The processes of parsing music structure and composing new music artificially both require a foundation of musical features. In other words, before we can describe how the parts of a composition interact, we must have a way to define those "parts." Once simple features have been defined, we can investigate how they combine in complex ways to form more abstract structures. In most cases, the simple features are derived from music theory: chord progressions, melodic riffs, syncopation, etc.
Other fields of machine learning such as speech and image recognition all tend to reach the same conclusion: hand-coding low-level features in a recognition system introduces a strong bias. For example, many image recognition algorithms define the simplest features in terms of "edges" (low-high or high-low contrast at different orientations), from which more complex features such as longer lines, corners, and curves can be built. There has been promising work [1, 2, 3] on learning the set of simple edge and contrast features from a set of images.
This concept of learning simple features has been extended to music in the past few years with varying amounts of success [4], literally transcribing the algorithms developed for image processing. I plan to do something similar to this.
The Big Question: Will unsupervised learning of simple music features result in better categorization (and therefore recognition) of musical styles?
Bonus Questions: To what extent do these automatically learned features resemble features informed by music theory? Also, can they be inverted to be a generative process instead of recognition (see artificial composition below)?
I have downloaded MIDI files of piano songs. I chose piano because I want to focus on a single instrument, at least to begin with, and I am most familiar with piano. I will begin with music by Scott Joplin because his music is particularly recognizable and relatively simple temporally. While training the system on a single composer may reveal some interesting characteristics of that composer's music, it should be trained on multiple composers in multiple genres to develop the most general set of features. Once the algorithm works for Joplin, I will test it on other composers as well.
I will be using a 3rd party MatLab toolbox for midi interaction. This toolbox is not complete, and throws away some data when importing into MatLab such as velocity (loudness), sustain, and tempo. Whether or not this will be a problem is up for debate - there are strong arguments for both which I will briefly outline. The obvious argument is that any data thrown away in preprocessing, whether intentional or incidental, is removing information, and less information means less encompassing features. The other argument, which I support, is that having too much data can be distracting to an algorithm looking for universal descriptors. One study found that their system was only able to parse music once they had preprocessed it to a simple representation like the one I will be using [5].
As mentioned above, algorithms from machine vision to generate simple features can be easily extended to music. The basic idea behind all of these algorithms is to keep a bank of convolution kernels, each of which represents some feature. The machine learning task is to find the set of kernels which best represents the salient or most common aspects in the test data. Many algorithms also maximize the sparseness of these kernels. Algorithms I am considering include Convolutional Sparse Coding, Invariant Predictive Sparse Decomposition, and Shift Invariant-Probabilistic Latent Component Analysis.
In Choice of Music Representation, below, I discuss how I plan to format a song in a 3D array in order to highlight the important dimensions (timing/beat and pitch).
Once I have generated some features, I will both inspect them by hand and try them in a music genre or artist classifier to automatically gauge performance. The classifier will most likely be an SVM.
By the milestone deadline, I hope to have coded and tested at least one of the algorithms mentioned above and have a preliminary set of features. I expect that interpreting the results I get will be an ordeal in itself; I would like to compare my results to classic music-theory-based features in addition to evaluating the classification accuracy of my new features.
(width x height x RGB)
, and the value at a location the brightness. The convolution kernels, then, are simply smaller 3D arrays which are multiplied element-wise with a chunk of the original image and summed. In music, however, time is the only obvious independent variable (i.e. all pitch, interval, etc. features are at one index of a 1D array of beats). For kernel convolution, I think it will work better with least one more independent dimension of the song data. Pitch alone would add a new dimension with 88 possible values. However, because of the periodic nature of octaves (and because I expect note intervals to emerge as a strong feature), I plan to represent a song as a 3D array of (beat, note, octave)
. In this way, a 2D slice of the matrix such as beat-note will immediately highlight intervals. A beat-octave slice will show trends in the range of a song. The next question is what the values at certain indices should mean. For mathematical simplicity, if a note is not being played at some time, it will have a value of 0
. When a note is being played, that index in the array could take on other values, maybe just boolean on
or off
; or, the note's duration in beats.
S
is a 3D matrix. S(b, n, o)
will be nonzero if and only if note n
in octave o
is being played on beat b
. I expect this representation will allow for learning the most robust possible features.P
learned from a song or composer, they should be able to recreate a similar phrase P'
. If P
is represented as, say, a tree where the leaves are simple features like chords and notes or intervals and the level above represents temporal combinations, etc, it is relatively easy to build phrase P'
. I am unsure at this point whether the features learned by my system will lend themselves well to reconstruction of novel phrases.