Object Categorization Using Spatial-Temporal Features

Chrisil Arackaparambil and Ashok Chandrashekhar

Project Milestone

Task

Objects from different semantic categories often have different motion signatures. Humans, animals, birds, cars, airplanes etc, have very distinct patterns of motion. Therefore, along with spatially salient features, the temporal patterns of movement can be used as features in order to determine the category of an unlabeled object.

In this project, we explore the suitability of spatio-temporal features in order to perform unsupervised classification of a set of objects belonging to different semantic categories using videos. We also evaluate the predictive capability of the learned model on unseen examples.

Milestone Accomplishments

In the past three weeks since the project proposal, we have constructed our dataset, obtained an off-the-shelf spatio-temporal feature extractor. We have obtained implementations for 2 unsupervised, generative, bag of feature classifiers, namely, Probabilistic latent semantic analysis(pLSA) and Latent DirichletAllocation (LDA).

Data Set

Our data set consists of videos three categories of objects in motion; sprinting cheetahs, walking cows, and crawling human babies. Most of these videos have been obtained from youtube website and have been crafted from longer videos by retaining only relevant sections which show the examplars in their characteristic motion. If time permits, we will add more semantic categories to the dataset.

Feature Extraction

In this project, we are employing the spatio-temporal feature extraction technique outlined in [1]. In this technique, interest points are detected in videos by applying a gaussian smoothing filter in the spatial plane and a gabor 1D filter in the temporal plane. Regions which return high values for the above response function are considered as interest points. In particular, visually salient regions exhibiting complex motion patterns invoke strong response while non-salient regions undergoing simple translation invoke weak responses.

Cuboids in the X-Y-T space are determined around these interest points and each cuboid is converted into a feature description vector using different methods. The dimensionality of these features is controlled by using Principal component analysis. Please refer [1] for the details.

[2] provides a Matlab implementation of the feature extraction algorithm. With the permission of the original authors, we have obtained the implementation and intend to use it in our project. The feature extraction package uses a matlab toolkit, also made available by the authors.

pLSA implementation

We intend to use a c++ implementation of the pLSA algorithm, implemented previously by Ashok for his research purposes. The pLSA implementation follows from the technical specification of the algorithm in [4]. However, the EM formulation is naïve and may be prone to overfitting problems. [4] suggests tempered annealing as a means of overcoming overfitting problems, but we do not intend to implement the modification.

LDA implementation

As per suggestion from the course instructor, we will use the implementation of latent dirichlet allocation[5] provided by the authors in [3] and compare the results with the pLSA implementation.

Future timeline

Our plan for the remaining three weeks to the final project submission is as follows.

References

  1. Piotr Dollár, Vincent Rabaud, Garrison Cottrell and Serge Belongie. Behavior Recognition via Sparse Spatio-Temporal Features. ICCV VS-PETS 2005.
  2. http://vision.ucsd.edu/~pdollar/
  3. http://www.cs.princeton.edu/~blei/lda-c/index.html
  4. Thomas Hofmann. Probabilistic Latent Semantic Indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999.
  5. Ng A. Y.Jordan M. I Blei, D. M. Latent dirichlet allocation. J. Machine Learn. Res. 3, 2003.
  6. Juan Carlos Niebles and Hongcheng Wang, and Li Fei-fei. Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British Machine Vision Conference, 2006.