Collaborative Filtering Algorithm Applied to MovieLens Data

CS134 Machine Learning Project Spring 2012
Sandeep Nuckchady

Introduction

Many e-commerce applications use recommender systems to predict the products that existing or new users might like. The predicting factor might be based on the ratings given in a database or/and other side information. To make such predictions two different well-known approaches are: content-based and collaborative filtering. In content-based filtering new items are recommended based on the similarities (e.g. similar characteristics or attributes) that the new item has on the previous favorite items by the user. On the other hand, collaborative filtering gives predictions based on what the user previously liked only i.e. if a user, x gives high rating to item i1 and i2 (where high rating means like very much), it's likely that another user u who gave high rating to item i1 would also do so on item i2.

Problem Definition

We have a matrix consisting of a list of , n, users {u_1,u_2,...,u_n} and a list of, m, items {i_1,i_2,...,i_m} with indices representing ratings given by a user. The aim is to predict the ratings of the unobserved data based on the prior preferences. Note that the matrix is usually very sparse.

Algorithms

Several state-of-the art algorithms are listed below:

  • K nearest neighbor: Each item is represented by a vector consisting of the different ratings of items by a user. Similarity (e.g. Pearson correlation coefficient or cosine similarity) is then determined using knn between this item and other items. K different items which are within some threshold are found and a weighted average for the queried user on the ratings of the [4] items he has not yet rated is determined.
  • Matrix Factorization:
    • Low dimensional Matrix factorization Assume that the user-item matrix is denoted by M:

      Let ZM

      Then, Z = UTL

      The item feature vector, L is usually called the input and the user feature vector, U is called the weight vector
    • Singular Value Decomposition

      Many variants of the svd have been used in the 2009 Netflix prize. The reader is referred to [2] for a summary of the different algorithms. It has also been shown in [6] that application of svd without taking into account missing ratings (matrix very sparse) would invariantly give poor responses and the predictions would not be accurate.

  • Restricted Boltzmann Machine

    This is based on neural network. It consists of one layer of neurons and another layer of hidden units The nodes in the different layers are connected whereas those in the same layer are not. The authors of [5] describe the algorithm named Contrastive Divergence which is a learning method to maximize a likelihood function.

Project Goal

The goal of this project would be to implement some of the matrix factorization algorithms with focus being on a variation of svd (incremental one). To get an appreciation of the different work done in this field, the aim would be to start with basics and increase the difficulty/complexity as more understanding is acquired.

Dataset

MovieLens data would be used [3]. This data has been collected by the GroupLens Research group at University of Minnesota from 1997 to 1998. There are three different versions each one larger than the other and the one chosen has 943 users, 1682 movies and 100,000 ratings (range from 1 to 5). It does not consist of users who have provided less than 20 ratings. The extracted file consists of U1.base and U1.test split in the ratio of 4:1 and would be used in the training and testing part of this project.

Timeline

On April 12: Project Proposal

By April 19: Implement a basic matrix Factorization Algorithm

By April 26: Implement Incremental SVD with regularization

By May 3: Start analysis of data and collect results

By May 8: Project Milestone; Write report/presentation and show results

By May 17: Catch-up time; writing project; compare rmse with m3f and latent vector

By May 24: Project should be almost complete, design poster and finish project write-up

On May 28: Poster Presentation

References

[1] http://www.cs.toronto.edu/~hinton/csc2515/notes/pmf_tutorial.pdf

[2] The BigChaos Solution to the Netflix Grand Prize, Andreas Toscher, Michael Jahrer, Robert M. Bell, 2009

[3] http://www.grouplens.org/node/73

[4] ] Clustering Items for Collaborative Filtering Mark O'Connor & Jon Herlocker, ACM SIGIR Workshop, 1999

[5] Restricted Boltzmann Machines for Collaborative Filtering, Ruslan Salakhutdinov, Andriv Mnih, Geoffrey Hinton, ICML, 2007

[6] Incremental Singular Value Decomposition Algorithms for Highly Scalable Recommender Systems, Badrul Sarwar, George Karypis, Joseph Konstan, John Riedl, GroupLens Research Group