Object Recognition and Tracking in Video Sequences

Geethmala Sridaran & Nimit Dhulekar

The document is divided into two sections:
Section 1: Work Progress & Challenges

As per the initial project proposal, our dataset consists of video sequences of different objects on a mobile. The objects are hanging loosely and can both rotate on their axes as well as exhibit oscillatory motion while revolving around the mobile axes in a circular plane. The video sequences thus collected consist of different objects like frogs, reptiles, amphibians, cars, bikes, trucks, and types of balls. The video sequences are collections of PGM images taken at 30 frames a second. Since the mobile is moving at a slow pace, we only save every third frame thereby having an effective frame rate of 10.

As part of our training dataset, we also took images of these objects from different perspectives (front view, side view and top view). The first step was extracting interesting features from the training images. We used a supervised approach to split the images into interesting features. Some examples of these features include windscreen, windows, wheels in case of cars and tail, body, arms, legs in case of reptiles. The next step involved applying Scalar Invariant Feature Transform (SIFT) detector to these features to get a set of feature descriptors which we could use as distinguishing features for the particular object. We also tried using Speeded Up Robust Features (SURF) detector but it required a JPEG input. Also since the performance was very similar to SIFT, so we decided to use SIFT. We provide a brief overview of SIFT here to make the following sections clear.

SIFT helps in identifying local features in an image. The algorithm was published by David G.Lowe. The algorithm consists of the following steps. For the test image, we are using a sliding-window concept to scan the image. We use a sliding-window which moves across the entire frame of the given video sequence. We might typically use different sized windows depending on the size of the features of the object we are searching for. For each such small window of the test image, we apply SIFT to find features from that window. Then these features are compared to the SIFT features we obtained for the interesting features from our training dataset. As it is known to us which object we are looking for, we need to do this matching for a small set of features obtained from the training data for that object only. Once we find a match, we look in the nearby spatial locations for other features related to that object. Thus we can arrive at an estimate for the bounding box around the object.

Finding the matches has been the most challenging part of the project yet and we have tried many approaches. Following are detailed some of the approaches we have tried: Section 2: Pending Work

The most important part of the project is to be able to combine the benefits of both the histogram and Mahalanobis distance approaches so as to get the best performance. We have not been able to track the object completely yet so as to draw a bounding box around it. But once we are able to reduce the number of false positives using the histogram and Mahalanobis distance, we feel that recognizing the object would not be very hard. We had mentioned in the initial proposal that we would like to perform some basic tracking by the milestone and we are behind on that. Thus time permitting we will try to finetune the tracking algorithm we are going to use.