CS 34 Project Proposal - Personalized RSS Feed Aggregation using Support Vector Machines

CS 34 Project Proposal -- John Eikens

Personalized RSS Feed Aggregation using Support Vector Machines

Concept

One of the problems inherent to RSS syndication is the "information overload" a user experiences when they have subscribed to a number of feeds. With thousands of RSS news items published every day, it is becoming increasingly difficult to filter out useful or interesting stories from junk. To meet this problem, some RSS aggregators have begun including filters to automatically hide irrelevant stories using machine learning techniques such as Naive Bayesian classification (1).

I propose using support vector machines to filter out irrelevant stories using user ratings, as support vector machines have been shown to be efficient and accurate in text classification tasks (2). During an initial training period, the user will be able to classify RSS items as "Interesting" or "Junk". The program will be using these classified items as a training set for a support vector machine (SVM), which will then be used to classify all future RSS items. After the training phase, the user will be able to recategorize items, which will retrain the SVM and affect future categorization.

Details

Developed in: Java.
Dataset: Since this project will classify stories based on individual user preferences, I will evaluate its performance on data I've collected myself from a group of testers.

Goals/Timeline

By the milestone, I would like to have:

A basic working version of the RSS aggregator with a user rating system so I can collect trial data from a group of testers.
Code which can convert an RSS item (.xml file) into an appropriate feature vector in an SVM.
(Hopefully) A working implementation of the specific form of support vector machine used.

By the end of the project, I would like to have:

A finished, working RSS aggregator, which allows the user to subscribe to a number of feeds and classify news items as "Interesting" or "Junk".
Code which will convert individual news items into feature vectors and use them to train or classify with a support vector machine
A display of "Interesting" stories based on the classifications of the user-trained SVM.
An evaluation of the effectiveness of my approach (from the testers' results).

Some References

E. Banos, et al. "PersoNews: A personalized news reader enhanced by machine learning and semantic filtering." Lecture Notes in Computer Science 4275 (2006): 975-982
Joachims, Thorsten. "Text categorization with support vector machines: learning with many relevant features." Lecture Notes in Computer Science 4275 (2006): 137-142
Sebastiani, Fabrizio. "Machine Learning in Automated Text Categorization." ACM Computing Surveys 34(2002): 1-47