Dating photos by decade

Dan Allen
CS 134 project proposal
dsallen@cs.dartmouth.edu

Introduction

Image recognition and classification are fields that have been advancing rapidly. Machine learning based techniques now make it possible for computers to "learn" what an object looks like, and decide whether an object belongs to a given class or not. Unlike much of previous work on this problem, I intend to focus not on single objects or classes of objects present in photos, but on other qualities which may hold information about the photo. For example, a type of film which was popular at one time but later became obsolete may leave clues that suggest when the photo was taken. In other words, I propose to see if a computer can "learn" what a decade looks like. One problem I expect to encounter while working on this project is a tendency for the lines between the decades to be considerably blurry - after all, photos are taken continuously and not all in one discrete batch each decade. For example, a photograph taken in 1949 might look more similar to one from 1951 than to one from 1940. My goal is to be able to take an arbitrary photo which I know to be from a specific decade, and have it classified as belonging to that decade. This is easy for most humans, usually a photo is not hard to date fairly accurately even by the untrained eye. I want to see how difficult this problem is for machine learning.

Methods

In light of the likely problems outlined above, I have chosen to use a method based on Support Vector Machines (SVM). Feature vectors will be extracted from the photos. This will involve additional research, as it is a quite specific set of features which I want to extract. The method I have decided to use for feature extraction is Principal Component Analysis(3). Then these vectors will be mapped to a high-dimensional space (about 1000) and an approach based on Chapelle et al. will be taken. (1) This method was shown to give good linear seperability between training data in high dimensions. To implement this project, I plan on using a combination of Python and the mlpy library(2), and MATLAB. After reviewing the literature(5), I will also be implementing k-Nearest Neighbor (k-NN) based classification to supplement my findings.

Dataset

The dataset will be collected from the Google LIFE Image archive(4), containing millions of photographs spanning roughly 250 years and made available by decade. I may decide to include photos from another historical archive, namely Shorpy. I will create training and test data sets for six decades: the 1860's, 1900's, 1920's, 1940's, 1960's, and 1970's. By the time I do my final testing, I intend to be using a training set of about a thousand examples and a test set of about 100. These numbers may fluctuate somewhat before the project is complete. The datasets will include a wide range of photos with many differences and variations, including both color and monochrome.

Timeline

4-12 -- 4-19: Gather and process data, begin coding
4-19 -- 4-26: Implement SVM and k-NN
4-26 -- 5-10: Begin first test runs
5-11 -- 5-12: Milestone; present preliminary results
5-13 -- 5-31: More testing, compare both results
6-1: Final Presentation

References

1) Support vector machines for histogram-based image classification. Chapelle O, Haffner P, Vapnik VN. IEEE Trans Neural Netw. 1999;10(5):1055-64.

2) MLPY homepage
3) A Tutorial on Principal Component Analysis. J. Shlens, Institute for Nonlinear Science, UCSD, 2005.

4) Google LIFE image archive

5) In Defense of Nearest-Neighbor Based Image Classification. Boiman, O.; Shechtman, E.; Irani, M. IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008.
6) Shorpy image archive