Object Recognition and Tracking in Video Sequences

Geethmala Sridaran & Nimit Dhulekar

The document is divided into two sections:

Work Progress & Challenges
Pending Work

Section 1: Work Progress & Challenges

As per the initial project proposal, our dataset consists of video sequences of different objects on a mobile. The objects are hanging loosely and can both rotate on their axes as well as exhibit oscillatory motion while revolving around the mobile axes in a circular plane. The video sequences thus collected consist of different objects like frogs, reptiles, amphibians, cars, bikes, trucks, and types of balls. The video sequences are collections of PGM images taken at 30 frames a second. Since the mobile is moving at a slow pace, we only save every third frame thereby having an effective frame rate of 10.

As part of our training dataset, we also took images of these objects from different perspectives (front view, side view and top view). The first step was extracting interesting features from the training images. We used a supervised approach to split the images into interesting features. Some examples of these features include windscreen, windows, wheels in case of cars and tail, body, arms, legs in case of reptiles. The next step involved applying Scalar Invariant Feature Transform (SIFT) detector to these features to get a set of feature descriptors which we could use as distinguishing features for the particular object. We also tried using Speeded Up Robust Features (SURF) detector but it required a JPEG input. Also since the performance was very similar to SIFT, so we decided to use SIFT. We provide a brief overview of SIFT here to make the following sections clear.

SIFT helps in identifying local features in an image. The algorithm was published by David G.Lowe. The algorithm consists of the following steps.

Scale-space extrema detection:
In this the interest points called the keypoints are identified. For this image is convolved with Gaussian filters and the difference of successive Gaussian-blurred images are taken. Keypoints are then taken as maxima or minima of the difference of Gaussians.
Keypoint localization:
There are many keypoints obtained from step 1 of which some are unstable. These can be eliminated by performing a detailed fit to the nearby data. This is achieved using interpolation of nearby data for accurate position. Then discarding low contrast key-points and eliminating edge responses.
Orientation assignment:
Each keypoint is assigned one or more orientations based on image gradient directions.
Finding the keypoint descriptor:
The feature descriptor is computed as a set of orientation histograms on (4 x 4) pixel neighbourhoods. The orientation histograms are relative to the keypoint orientation and the orientation data comes from the Gaussian image closest in scale to the keypoint's scale. Histograms contain 8 bins each, and each descriptor contains a 4x4 array of 16 histograms around the keypoint. This leads to a SIFT feature vector with (4 x 4 x 8 = 128 elements). This vector is normalized to enhance invariance to changes in illumination.

For the test image, we are using a sliding-window concept to scan the image. We use a sliding-window which moves across the entire frame of the given video sequence. We might typically use different sized windows depending on the size of the features of the object we are searching for. For each such small window of the test image, we apply SIFT to find features from that window. Then these features are compared to the SIFT features we obtained for the interesting features from our training dataset. As it is known to us which object we are looking for, we need to do this matching for a small set of features obtained from the training data for that object only. Once we find a match, we look in the nearby spatial locations for other features related to that object. Thus we can arrive at an estimate for the bounding box around the object.

Finding the matches has been the most challenging part of the project yet and we have tried many approaches. Following are detailed some of the approaches we have tried:

k-nearest neighbour with Best Bin First Search
Once the descriptors are extracted, k-NN with Eucledian distance was computed to find the desired object of interest in the test data. Since this is an expensive process, a k-d tree with Best Bin First search algorithm [1] was used to speed up the process of finding the nearest neighbors (we tried with k = 2 and k = 3, and k = 2 seemed to perform better) To increase robustness, matches are rejected for those keypoints for which the ratio of the nearest neighbor distance to the second nearest neighbor distance is greater than a threshold (we tried values from 0.5 to 0.8. 0.5 seemed to yield very few matches and 0.8 a huge number of matches, we are yet to decide on an optimal threshold value)
Histogram computation This idea was inspired by a discussion with Professor Toressani. We tried four approaches to calculate the histograms.
- We took all the SIFT descriptors from the training dataset and created 128 vectors corresponding to the distribution of each dimension of the feature vector. This basically means that in case we got 76 feature vectors from our training set each consisting of 128 dimensions, we considered them as 128 vectors of 76 dimensions and plotted a histogram of 8 bins for these vectors. The idea here is that there is a correlation between individual dimensions of feature vectors whereas across feature vectors, the dimensions might not be correlated. Similar process was applied on the test set feature vectors. We then computed the ratio of bins between the training and test images to see if the ratio remains similar.
- We took all the SIFT descriptors from the training dataset and took a mean of all the dimensions of the feature vector. Thus if there were 76 feature vectors each of dimension 128, we took a mean of all 128 dimensions for each of the 76 vectors thus resulting in 76 mean values. These values were plotted on a histogram with 8 bins. As above the ratio of bins were computed for training and test sets. This idea proved to be worse than the first because correlation across the dimensions of a feature vector is poor whereas correlation among the same dimension for all feature vectors is high.
- We applied the imhist function in Matlab. This function basically requires as input an image. Thus we gave it the image itself ie pixel values. As above ratio of bins was calculated for the training and test images. This approach performs well when there are more number of features.
- We tried to apply a Kernel density function on the training and test images so as to model the data as Gaussian. But we soon found that the Gaussians obtained thus were very similar peaking at the same points and thus a useful comparison could not be made.
Mahalanobis distance metric As proposed in the initial project proposal, this was the main technique that we would like to use. To apply the Mahalanobis distance we used two techniques.
- The first technique involved calculating the Mahalanobis distance between each of the dimensions of the test feature vector and that particular dimension for entire training dataset. Once this was done, the test feature vectors were sorted on the basis of how many of their dimensions had the lowest distance from the training dataset. Although the idea seemed good, it didn't perform too well and would return many false positives.
- The second technique involved taking the mean of the Mahalanobis distances for each feature vector. We then set a threshold to see how many such vectors had a distance of less than 0.6. This technique proved to be much better than the first technique and gave fewer false positives.

Section 2: Pending Work

The most important part of the project is to be able to combine the benefits of both the histogram and Mahalanobis distance approaches so as to get the best performance. We have not been able to track the object completely yet so as to draw a bounding box around it. But once we are able to reduce the number of false positives using the histogram and Mahalanobis distance, we feel that recognizing the object would not be very hard. We had mentioned in the initial proposal that we would like to perform some basic tracking by the milestone and we are behind on that. Thus time permitting we will try to finetune the tracking algorithm we are going to use.