Multi-class Object Recognition in Image

CS34/134 Project Report
Multi-class Object Categorization in Images

Lu He Tuobin Wang
lu.he@dartmouth.edu tuobin.wang@dartmouth.edu
Advisor: Lorenzo Torresani
Computer Science Department
Dartmouth College

Abstract:
Object categorization is a very interesting topic in the fields of Computer Vision and Machine Learning. Recently, the technique of modeling the appearance of object by its distribution of visual words has been proved successful. Our project aims to build an image categorization and recognition system. The scenario of our system is to load an image from the file system and user-interactively select a region from the image. Our system can recognize the object in the selected region and output the object name.

Introduction:
Object categorization problem is challenging because the appearance of objects may be different due to changes of location, illumination, pose, size and viewpoint. One type of approach is based on the spatial arrangement of image, which use object fragments and its spatial arrangement to represent images. Some of this kind of model has been proved powerful, such as fragment-based models [2, 3]. However, they can not handle variations and deformation of objects. Moreover, they perform poorly on recognition of grass, sky and water, which are shape-less objects. The approach applied in our project is a kind of appearance-based method. Appearance-based methods work well on both shape-free objects, such as sky, grass and trees, and high-structured objects like car, face and etc.

Workflow:
In our project, we implemented the algorithm published in [1]. The workflow of our project is in the following figure:

Filter bank:
In our project, each training image is changed into CIE L, a, b color model, and then is convolved with 17 filter-banks to generate features. The filter-banks are made of 3 Gaussians, 4 Laplacian of Gaussians (LoG) and 4 first order derivatives of Gaussians. While the three Gaussian kernels are applied to each CIE L, a, b channel, each with three different value σ = 1, 2, 4, the four LoGs are applied to the L channel only, each with four different value σ = 1,2,4,8, and the four derivative of Gaussians kernels were divided into two x- and y- aligned set, each with two different value σ = 2, 4. Then, we get each pixel in the image associated with 17-dimensional feature vector.

Learn dictionary:
There are several steps to learn a final dictionary from the training data (see the following figure).

Initial Visual Dictionary:
After image preprocessing step, each pixel in every training image can be represented by a 17-dimensional feature vector. We group these features together and run the unsupervised k-means clustering algorithm with a large size of centroid number K using the Euclidean distance. The generated K clustering centroids form our initial visual dictionary.

Build Histogram on Dictionary:

In the dataset we used, each image (left lower image) is associated with a ground truth image (left upper image). Labeling has been achieved for all the training images by a simple, interactive “paint” interface. Same colors correspond to same object classes (In our example shows above, blue color represents cow and green color represents grass).

For each training region (only contains one single object), we want to build its histogram over the existing dictionary. Before doing this, we need to obtain the features of each training region, which can be done by extracting the positions of pixels with a specific color in the ground truth image and using these positions of pixels to get the corresponding features of each training region in the real image.

For each training region, we use statistical method to compute the corresponding histogram by simply count features that drop in each visual word. We say feature i drop in visual word j is the Euclidean distance between feature i and visual word j is smaller than the distances to all other visual words. (A histogram sample shows above).

Reduce visual dictionary size:
We can build histograms for each training region over the initial visual dictionary obtained from previous steps. Then, we model the set of histograms for each class c using a Gaussian distribution:

We define the prior probability as:

Our model is to maximizing the distribution over all the histograms given the ground truth label:

However, when maximizing this probability, we will end up with all bins merge together to one bin, since all histograms for any object will be identical.
So, consider this in a different view, by applying Bayes’ rule, what we want is to maximize:

For this equation, the denominator is to penalize mappings which reduce discriminability. And, the numerator favours mappings which lead to small intra-class variance.

Algorithm for maximizing:

Experiment and Analysis:
We used the pixel-wise labelled image database v1 from Microsoft Research Group, and tested our system in five objects (building, car, face, grass and cow) by using k-nearest neighbor algorithm with k = 1, 3, and 5. The results are in the following:

Average Trained histogram for each class:

From left to right: Building, Car, Cow, Face and Grass.

Due to the large differences between different histograms above, we confirm that our model is disciminative. On the other side, the size of our final dictionary is 11, so our model is also compact.

Accuracy:

Although we reduced the dictionary size from 50 to 11, the performance does not change too much. And, one thing needs to mention here is the result of a bigger k in kNN performs worse than a smaller k. Our analysis is that, on the consideration of running time, complexity and memory, we only trained a small size of data. There are not enough trainging samples for some objects.

Confusion Matrix:

Dictionary size = 11, k = 1.

This is the confusion matrix by using kNN with k = 1 and the size of dictionary is 11. We test 50 building, 66 grass, 45 cow, 30 face and 30 car cases. We obtained at least 82.2% accuracy of infering the true label for each object.

GUI:
The following figure is the simple interface for our project.

Load an image:

Select a region:

Output the result:

Reference:

J. Winn, A. C. and T. M. Object Categorization by Learned Universal Visual Dictionary. IEEE International Conference on Computer Vision (ICCV 05), Beijing, China, 2005.
J. Shotton, A. Blake, and R. Cipolla. Contour-based learning for object detection. In ICCV05, pages 503–510, 2005.
S. Ullman, E. Sali, and M. Vidal-Naquet. A fragment-based approach to object representation and classification. In Proc. 4th Intl. Workshop on Visual Form, IWVF4, Capri, Italy, 2001.