MNIST data set contains 60,000 training examples and 10,000 test examples. Each is a 28x28 pixel grey-scale image where 0 is black and 1 is white. There are many approaches to hand-written digit recognition in the literature, for example linear classifiers, pairwise linear classifiers, k-nearest-neighbors, SVM and neural networks. LeCun et al. (1998) discusses the various classifiers and here I provide a summary of their performance, and what is unnatractive about them.
1) Linear classifer: pixel vector * weighted sum = score. Error rate = 12%, 8.4% on deslanted images. Unattractive error rate.
2) Pairwise classifier. Create 45 unites, i/j e.g 0/1, 0/2, … 0/9 , 1/2, 1/3… , 8/9. i/j gives +1 when I is more likely, and -1 when j is more likely. Final score for class I is sum (i/x) – sum(y/i) for all x and y. Error rate = 7.6% for un-deslanted images. Also a high error rate.
3) K-nearest neighbors. Error rate of 5% and 2.4% on deslanted with k = 3. However, K-NN is too memory and computational intensive given the 60,000 training data values.
In contrast, convoluted neural networks are particularly effective, especially given the large data set. LeNet-5, one of the most successful digit classifiers uses convoluted neural networks.
Initially I tried to implement and optimize LeNet-5, however I ran into several issue. The algorithm has many layers, and although I could propagate activations down to layer S4, it was unclear how to go from S4 to C5. Instead, I searched for a simpler algorithm to implement. I found a paper on LeNet-1, which only uses 4 hidden layers, and chose to implement this as a starting point.
Using convolution allows us to reduce the number of parameters to optimize. Each convolution also allows for the extraction of specific features, and by forcing specific connectivity within the hidden layers of the neural network, we can force a break in symmetry of the weighting vectors.
Above: Represenation of LeNet-5, showing how feature maps are created using convolutions.
LeNet-1: 4 Convolution layers gets subsampled. Then 12 convolution layers are created from the 4 subsampled layers. These 12 layers are fully connected to 10 output layers that correspond to the labels. The convolution has a receptive field of 5x5 pixels.
Layer |
Input |
C1 |
S1 |
C2 |
S2 |
Output |
# of layers |
1 |
4 |
4 |
12 |
12 |
1 |
dimension |
28 |
24 |
12 |
8 |
4 |
10 |
LeNet-5: more layers, initial image is padded to be 32x32 so that the receptive field can center on the corner of the source image
Layer |
Input |
C1 |
S2 |
C3 |
S4 |
C5 |
F6 |
Output |
# of layers |
1 |
6 |
6 |
16 |
16 |
1 |
1 |
1 |
Dimensions |
32 |
28 |
14 |
10 |
5 |
120 |
84 |
10 |
Unlike a linear classifier, convolution neural networks cannot be solved analytically and are solved using gradient back-propagation. Back propagation operates on the idea that gradients can be computed by propagation from the output to the input.
We face the design decision what method to use to update our parameters . Second-order methods are unsuitable for large neural networks on large data sets since they are O(N3) or O(N2). As such, we shall implement gradient descent.
Gradient descent can be implemented using either batch or stochastic methods. Stochastic methods are particularly suited for the problem on hand since there is a large amount of redundancy in the data, so there is less need to compute the gradient over all data points. It also has the advantage of sometimes escaping local minima that aren’t the global minimum. As such, I used stochastic gradient descent.
Before implementing LeNet-1, I first created a simple 2-layer neural network (one hidden layer) to test the back-propagation gradient descent methodology. Similar 2-layer NNs in the literation gave an error on the test set of 4.7% with 300 hidden units, and 4.5% for 1000 hidden units.
I have a fully function 2-layer neural network classifier that optimizes the learning objective using back propagation gradient descent.
The next step is to create a 3-layer neural network, as these have yielded error rates of 2.95% and 2.50% in the literature. I then want to implement LeNet-1, since the idea of receptive fields interests me. Convolution using receptive fields allows for the extraction of specific features, such as horizontal lines, or end points, which intuitively match up with the machine learning task we are trying to solve. I have already written code to extract receptive fields from images given the feature-map location, as well as the subsampling function. I will implement feature maps as vectors, where each pixel is a unit in a hidden layer.
I think overall I am slightly behind where I would have liked to be at this point, because I hit a wall trying to implement LeNet-5 when I should have tackled a simpler version first and built my way up. However, now that the back-propagation is working I should be close to implement LeNet-1. My milestone target was to have a rudimentary implementation of one of the algorithms, and I feel that I have reached my objective.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "handwritten digit recogntion with a back-propagation network," in Advances in Neural Information Processing systems 2 (NIPS *89), David Touretzky, Ed., Denver, CO, 1990, Morgan Kaufmann
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; , "Gradient-based learning applied to document recognition," Proceedings of the IEEE , vol.86, no.11, pp.2278-2324, Nov 1998. doi: 10.1109/5.726791