Hand written digit classification is a pertinent machine learning problem today. It is widely used in ATM cheque deposits, the post office and credit card receipts.
There are many approaches to hand-written digit recognition in the literature, for example linear classifiers, pairwise linear classifiers, k-nearest-neighbors, SVM and neural networks. LeCun et al. (1998) discusses the various classifiers and here I provide a summary of their performance, and what is unattractive about them.
1) Linear classifier: pixel vector * weighted sum = score. Error rate = 12%, 8.4% on deslanted images. Unattractive error rate.
2) Pairwise classifier. Create 45 units, i/j e.g. 0/1, 0/2, … 0/9 , 1/2, 1/3… , 8/9. i/j gives +1 when I is more likely, and -1 when j is more likely. Final score for class I is sum (i/x) – sum(y/i) for all x and y. Error rate = 7.6% for un-deslanted images. Also a high error rate.
3) K-nearest neighbors. Error rate of 5% and 2.4% on deslanted with k = 3. However, K-NN is too memory and computational intensive given the 60,000 training data values.
In contrast, convoluted neural networks are particularly effective, especially given the large data set. LeNet-5, one of the most successful digit classifiers uses convoluted neural networks.
MNIST data set contains 60,000 training examples and 10,000 test examples. Each is a 28x28 pixel grey-scale image where each pixel is [0,1] - 0 is black and 1 is white. There are other data sets available but the MNIST data is widely used in literature, allowing me to have a benchmark to compare my results to, so I selected it.
Initially I tried to implement and optimize LeNet-5, however I ran into several issues. LeNet-5 uses convolution, which allows us to reduce the number of parameters to optimize. A feature map is constructed using a 5x5 weight matrix, so there are 25 weight parameters per feature map, instead of #input-features x #hidden-nodes. Each convolution also allows for the extraction of specific features, and by forcing specific connectivity within the hidden layers of the neural network, we can force a break in symmetry of the weighting vectors. The algorithm has many layers, and although I could forward-propagate activations, I could not find a discussion of how to do this in the literature, and could not back-propagate my error through convolutions, meaning I could not train my classifier. Figure 1 shows a graphical representation of the feature maps and hidden layers of LeNet-5.
Instead, I searched for a simpler algorithm to implement. There is a simpler version of LeNet-5, Le-Net-1, with fewer hidden layers, however it still has the same problem as LeNet-5 in that I cannot back-propagate the error through the convolutions. This was disappointing, as I researched LeNet algorithms thoroughly, and would have liked to have implemented them to investigate the heuristic aspects of the algorithm, such as the number of feature maps at each layer.
Figure 1: Represenation of LeNet-5, showing how feature maps are created using convolutions.
LeNet-1: 4 Convolution layers gets subsampled. Then 12 convolution layers are created from the 4 subsampled layers. These 12 layers are fully connected to 10 output layers that correspond to the labels. The convolution has a receptive field of 5x5 pixels.
Layer |
Input |
C1 |
S1 |
C2 |
S2 |
Output |
# of layers |
1 |
4 |
4 |
12 |
12 |
1 |
dimension |
28 |
24 |
12 |
8 |
4 |
10 |
LeNet-5: more layers, initial image is padded to be 32x32 so that the receptive field can center on the corner of the source image
Layer |
Input |
C1 |
S2 |
C3 |
S4 |
C5 |
F6 |
Output |
# of layers |
1 |
6 |
6 |
16 |
16 |
1 |
1 |
1 |
Dimensions |
32 |
28 |
14 |
10 |
5 |
120 |
84 |
10 |
Since I couldn't implement LeNet, I changed my objective and instead investigated adjusting the number of hidden nodes in "vanilla" Neural-Networks.
My survey of the literature found that 2-layer neural networks used 300 hidden nodes and 10 output nodes, and that 3-layer used 300-100-10 nodes. I implemented both of these architectures successfully and achieved results equivalent to those presented in the literature.
Each node's activation was the weighted sum of the nodes from the previous layer. This value was then passed through an activation function. I used tanh() as my activation function for the hidden layers. Since digit recognition is a multi-class classification problem, I used a softmax activation function for my output layer. Using softmax means that the output values represent the probability of the node being the correct label, and it is a distribution, so the sum overall output values is 1. The output node with the highest probability is selected as the label.
It is important to be aware of the activation function used at each node, since back propagation requires computing the gradient of the activation function to send the errors back through the network.
We face the design decision what method to use to update our parameters. Second-order methods are unsuitable for large neural networks on large data sets since they are O(N3) or O(N2). As such, we shall implement gradient descent.
As discussed, the error function's gradient used in gradient decent is computed using back propagation. Gradient descent can be implemented using either batch or stochastic methods. Stochastic methods are particularly suited for the problem on hand since there is a large amount of redundancy in the data, so there is less need to compute the gradient over all data points. Stochastic gradient decent also requires significantly less epochs to converge, since the weights are adjusted after each forward propagation of a data point, instead of after 60,000 data points. Each epoch takes approximately 140s to run on a 2-layer neural network, and longer on a 3-layer. Stochastic gradient decent also has the advantage of sometimes escaping local minima that aren’t the global minimum. As such, I implemented stochastic gradient descent in my classifier.
I first created a simple 2-layer neural network (one hidden layer) to test the back-propagation gradient descent methodology. Similar 2-layer NNs in the literation gave an error on the test set of 4.7% with 300 hidden units, and 4.5% for 1000 hidden units, other papers had an error of ~3.0%. My 2-layer NN with 300 hidden nodes gave a test error of 4.42%, which indicated that I had successfully implemented my neural network. Figure 2 shows the error vs. the number of epochs for my 300 hidden units 2-layer NN.
I have several explanations for the differences in my results from literature. One explanation is that the initial weights are determined randomly, and the final error is non-deterministic since the error function is non-convex and gradient decent can get trapped in local minima. As a result, my results could have been different were I to run the simulation again, and take the best error of several runs. In practice, training a classifier that uses 1000 hidden nodes is computationally extremely intensive, and takes much longer, with diminishing returns for the reduction in error.
Another reason why more nodes had a higher error rate, and also had an error rate that was worse than the literature is that I trained my classifier using only 4 epochs. Having more nodes not only means the training time per epoch is longer, but also that it takes more epochs for the classifier to converge. This would explain why for a constant number of epochs, 50 nodes outperformed 1000 nodes. I would guess that if the classifier were trained over say, 50 epochs, the results would show that more hidden units improved the classifier accuracy, though I would also expect to see diminishing returns as the number of nodes increased.
After implementing my 2-layer neural network, I moved on to implementing a 3-layer neural network, as this was an intermediary between a 2-layer NN and LeNet-1. Back-propagating errors through a 3-layer neural network was more complicated than through a 2-layer NN, but I successfully trained my classifier. I achieved results that were in line with the literature, and better than a 2-layer neural network. My misclassification rate was 2.52% after 12 epochs, compared to 2.50% found in the literature. Figure 4 shows the error rate vs. the number of epochs.
Figure 4: Error vs number of epochs for a 3-layer neural network with 300-100-10 nodes.
It is interesting to note the divergence between the training and test error rate after 7 epochs. This seems to show that the classifier is overfitting the training data, since there isn't a corresponding improvement improvement in the test error.
My final analysis was to compare the misclassification of each digit, to see if any insight could be gained into the weaknesses of using a neural network. I found that some digits were classified with high accuracy even after the first epoch, while others were very poor. However, after the 2nd epoch, these errors all dropped, and by the final epoch most digits had comparable misclassification rates. There were no conclusions I could draw from this, except that I conjecture that digits with similar features, such as the loops in 2, 5, 8 and 9 have an initially high error because of these similar features. Figure 5 and 6 show these errors for my 2-layer and 3-layer neural network respectively.
Figure 5: Error for each digit after each epoch for 2-layer neural network
Figure 6: Error for each digit after each epoch for 3-layer neural network
While I couldn't successfully implement a neural network with convolutions, there were still meaningful results from my 2 and 3 layer neural network.
My networks achieved classification errors on par with the literature, indicating that my implementation was successful and accurate. My findings of error rate vs. number of hidden nodes match intuition, and there could be further work done to investigate this relationship over various numbers of epochs.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, "handwritten digit recogntion with a back-propagation network," in Advances in Neural Information Processing systems 2 (NIPS *89), David Touretzky, Ed., Denver, CO, 1990, Morgan Kaufmann
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; , "Gradient-based learning applied to document recognition," Proceedings of the IEEE , vol.86, no.11, pp.2278-2324, Nov 1998. doi: 10.1109/5.726791
Bishop, Christopher M. Pattern Recognition and Machine Learning. New York: Springer, 2006. Print.