Objects from different semantic categories often have different motion signatures. Humans, animals, birds, cars, airplanes etc. have very distinct patterns of motion. Therefore, along with spatially salient features, the temporal patterns of movement can be used as features in order to determine the category of an unlabeled object.
In this project, we explore the suitability of spatial-temporal features in order to perform unsupervised classification of a set of objects belonging to different semantic categories using videos. We also evaluate the predictive capability of the learned model on unseen examples.
We will study different feature extraction methods which determine spatial-temporal salient features in videos. In order to determine their suitability, we will employ a bag of features learning technique called probabilistic latent semantic analysis (PLSA) [1]. This technique ignores structure among features and represents scenes as histograms over a fixed vocabulary of quantized features. PLSA, being a generative model, prescribes a method by which new scenes can be constructed from features.
Thus in order to classify an existing set of visual scenes (videos), we must estimate the parameters of the model through a model fitting process. The learned parameters can then be used to predict the category for a previously unseen scene.
Similar work has been done before for text classification [1], image classification [2] and human activity recognition [3].
We will construct a small dataset of videos for different objects in motion. Several sample videos will be made for each of the semantic categories that we intend to learn. We will also investigate existing vision datasets for suitability.