Object Recognition and Tracking in Video Sequences

Geethmala Sridaran & Nimit Dhulekar

Abstract:
Object recognition is one of the hardest challenges for computer vision systems today. Humans find this task extremely trivial and can recognize objects even if they are rotated, translated, obscured etc. Many approaches have been applied to object recognition in single still images or still images of the object taken from different perspectives and in different poses. We propose a method to extend object recognition to video sequences. The task is to be able to recognize the object and be able to track it in a video sequence. The challenges include recognizing the object even when looking at it from a different perspective and pose, tracking the object while it is in motion, recognizing the object even when it is partially/completely occluded. This project proposal is divided into 6 sections. In Section 2, we define the goal in more detail. In Section 3, we describe the method we intend to use. Section 4 looks at the datasets we are going to use. Section 5 lists the references and section 6 gives a rough estimate of the progress timeline.

Goal:
The goal of the project is to recognize objects in video sequences and then track them. The video sequence is basically frames of images stitched together at a rate such that the individual images are imperceptible to the human eye and to the human eye it looks like a continuous motion. Thus in essence we are also working with single still images but the intuition is that there is more information available in video sequences rather than in single still images.

For the purpose of the project, we intend to dangle objects from a revolving mobile. The objects thus can revolve in a circular plane. Since the objects are dangling and not held steadfast, they also have freedom to rotate about their own axis and also exhibit oscillatory pendulum-like motion. This unconstrained motion of the objects makes the task at hand both challenging and interesting.

Once an object is recognized, the next task is to be able to track it as it moves (displaying all the motions as explained in the previous paragraph). This task is complicated by the fact that the object while rotating on its own axis will display different poses and perspectives. Also since there are multiple objects hanging from the mobile, there is a high probability that the objects can occlude each other.

Method:
The implementation method involves obtaining the training data set and then performing object detection and tracking on the test video sequences.

Following two steps are performed to obtain the training data set:

Feature Extraction
The first step involves extracting feature descriptors from the images. The images are captured at thirty frames per second using the Prosilica GC650 cameras. Our videos and training set will be in grayscale. The images are given to the SIFT [4] (Scalar Invariant Feature Transform) / SURF [6] (Speeded Up Robust Features) algorithm to extract feature descriptors.

Categorization
The features extracted via SIFT/SURF are then classified into categories based on spatial relation. The next step is very important and is identifying discriminating features. This will allow us to distinguish between similar class of objects like car and truck or even car and aeroplane. Last step is to find the geometric relation between the features so as to maximise the probability that the hypothesis that it is a particular object is accurate.

Once the training data is obtained, the following operations are performed to achieve object detection and tracking:

k-NN classification with Mahalanobis Distance
The testing video sequence will be sampled for matching features of the object of interest and once the features are extracted, a bounding box will be drawn around the object. For matching the features, k-NN classification will be performed with the Mahalanobis distance metric [2] [3].

Tracking
Once the bounding box is obtained, the tracking algorithm which uses incremental Principal Component Analysis [1] will track the object. This algorithm has to be fine tuned to track the object on the mobile while handling occlusion and drastic change in orientation of the objects. Also, one of our future goals is to make the tracking algorithm more interactive so that if the object is lost during tracking, the object detection algorithm has to find it again and give the result to the tracking algorithm.

The figure below graphically depicts the approach we intend to apply to the project

Datasets:
We are going to use our own dataset consisting of grayscale video sequences of objects hanging from a mobile. The classes of objects we are going to use are typically taken from the Caltech 101 dataset and include cars, motorcycles, frogs etc.

References:

D. A. Ross, J. Lim, R. Lin, M. Yang. Incremental Learning for Robust Visual Tracking. In International Journal of Computer Vision (IJCV), 2008.
A. Globerson, S. Roweis. Metric Learning by Collapsing Classes. In Neural Information Processing System (NIPS), 2005.
J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood Components Analysis. In Advances in Neural Information Processing Systems (NIPS), 2004.
David Lowe’s SIFT page http://www.cs.ubc.ca/~lowe/keypoints/ D. G. Lowe. Object recognition from local scale-invariant features. In International Conference on Computer Vision, Corfu, Greece (September 1999), pp. 1150-1157.
NIPS ’06 Workshop on Learning to Compare Examples http://bengio.abracadoudou.com/lce/
H. Bay, T. Tuytelaars and L. V. Gool. SURF: Speeded Up Robust Features. In 9th European Conference on Computer Vision, Springer LNCS volume 3951, part 1, pp 404--417, 2006.

Timeline:

Apr 24 – Training datasets with different permutations of objects hanging from the mobile.
Apr 30 – Perform k-means clustering to obtain the codebook
May 12 – Matching features between training and testing data and basic tracking
May 28 – Fine tuning tracking algorithm and optimizing code