Project Proposal

Introduction

   Digg is a social news website, where users get to decide the news. Articles are submitted to an "upcoming" section, where users who like them can "digg" them. Those that get enough diggs then get bumped up to the site's front page, where they are often rewarded with massive volumes of traffic. Websites ranging from the respectable to the esoteric have reached the front page of digg.
   In my project, I will examine newly-submitted stories in the "upcoming" section of DIgg, and try to predict whether they will reach the homepage. The problem of predicting which stories will reach the front page is well-suited to a machine learning approach, since hand-coding a function to solve it would be extremely difficult.
   I hope to learn some interesting things about this unusual website, and perhaps gain some insight into the factors that affect an article getting to the front page.

Data Set

   Digg has a comprehensive public API, which allows easy access to important information like which stories have been recently submitted, which stories recently made the front page, and which users have had the most success getting front-page stories. I plan to collect about a week's worth of data to use for training and prediction.
Some features of the stories that I plan to use for learning are:
1. Website the article points to
   - number of articles from that site that have reached the front page
   - percentage of articles from that site that have reached the front page
2. Keywords in the article title
   - percentage of articles that reach the front page containing those keyword
3. User submitting the article
   - rank of the user
   - number of friends
   - number/percentage of stories the user has submitted that have reached the homepage
4. Diggs on the story
   - how many diggs so far, collected at regular intervals.
   - stats/rankings of the most popular users who dugg the story

Algorithms

   This project is essentially a binary classification problem: stories need to be put into two classes depending on whether they are predicted to make it to the front page or not. I suspect this particular problem may be quite tractable, since a story's success seems to be very strongly correlated to some of the features mentioned above. For example, more than half of the front-page stories come from just 10 users, suggesting a strong link between a story's success and its submitter. And of course, the number of diggs on a story has an even more direct relationship to success, since a story gets to the front page only when it has a certain number of votes.
   Taking this into consideration, I will attempt a simple classification algorithm like logistic regression at first, and depending on the results, I may move on to SVM, or perhaps another, more complex classification algorithm.
   I also plan to be able analyze the relative importance of each feature in determining the story's class.

Timeline

- Collect data set (April 28th)
- Implement learning algorithm and analyze preliminary results (May 3rd)
- Create Milestone Report (May 10th)
- Modify algorithm and data set features, as required
- Create graphs and charts describing results
- Create Poster and write Final Report

References

1. Digg API Documentation : http://apidoc.digg.com/