Object Categorization Using Spatial-Temporal Features

Chrisil Arackaparambil and Ashok Chandrashekhar

Task

Objects from different semantic categories often have different motion signatures. Humans, animals, birds, cars, airplanes etc. have very distinct patterns of motion. Therefore, along with spatially salient features, the temporal patterns of movement can be used as features in order to determine the category of an unlabeled object.

In this project, we explore the suitability of spatial-temporal features in order to perform unsupervised classification of a set of objects belonging to different semantic categories using videos. We also evaluate the predictive capability of the learned model on unseen examples.

Method

We will study different feature extraction methods which determine spatial-temporal salient features in videos. In order to determine their suitability, we will employ a bag of features learning technique called probabilistic latent semantic analysis (PLSA) [1]. This technique ignores structure among features and represents scenes as histograms over a fixed vocabulary of quantized features. PLSA, being a generative model, prescribes a method by which new scenes can be constructed from features.

Thus in order to classify an existing set of visual scenes (videos), we must estimate the parameters of the model through a model fitting process. The learned parameters can then be used to predict the category for a previously unseen scene.

Similar work has been done before for text classification [1], image classification [2] and human activity recognition [3].

Dataset

We will construct a small dataset of videos for different objects in motion. Several sample videos will be made for each of the semantic categories that we intend to learn. We will also investigate existing vision datasets for suitability.

Timeline

References

  1. Thomas Hofmann. Probabilistic Latent Semantic Indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999.
  2. J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering objects and their location in images. IEEE International Conference on Computer Vision, October, 2005.
  3. Juan Carlos Niebles and Hongcheng Wang, and Li Fei-fei. Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British Machine Vision Conference, 2006.