CS 34 Project Proposal -- John Eikens

Personalized RSS Feed Aggregation using Support Vector Machines

Concept

One of the problems inherent to RSS syndication is the "information overload" a user experiences when they have subscribed to a number of feeds. With thousands of RSS news items published every day, it is becoming increasingly difficult to filter out useful or interesting stories from junk. To meet this problem, some RSS aggregators have begun including filters to automatically hide irrelevant stories using machine learning techniques such as Naive Bayesian classification (1).

I propose using support vector machines to filter out irrelevant stories using user ratings, as support vector machines have been shown to be efficient and accurate in text classification tasks (2). During an initial training period, the user will be able to classify RSS items as "Interesting" or "Junk". The program will be using these classified items as a training set for a support vector machine (SVM), which will then be used to classify all future RSS items. After the training phase, the user will be able to recategorize items, which will retrain the SVM and affect future categorization.

Details

Goals/Timeline

By the milestone, I would like to have: By the end of the project, I would like to have:

Some References

  1. E. Banos, et al. "PersoNews: A personalized news reader enhanced by machine learning and semantic filtering." Lecture Notes in Computer Science 4275 (2006): 975-982
  2. Joachims, Thorsten. "Text categorization with support vector machines: learning with many relevant features." Lecture Notes in Computer Science 4275 (2006): 137-142
  3. Sebastiani, Fabrizio. "Machine Learning in Automated Text Categorization." ACM Computing Surveys 34(2002): 1-47