Milestone Report

What follows is a summary of my progress upto and including May 12th.

Data Collection:

   Data collection was slower than expected.The digg api itself is slow, limited to 100 items per query, with a one second delay requested between queries. Nevertheless, over the course of a weekend, I managed to download 3 million stories, amounting to 1.26 gigabytes. I noticed, however, that while this included the story descriptions, story link, basic information on the submitting user, and the total number of diggs, it was missing other information I needed like the total number of friends a user has, or a story's history of diggs. These had to be obtained through separate queries.
   I began these queries, but soon noticed that it was going to be infeasible, both in terms of time and space. There were many users with friends and fans in the thousands, and these would be fed to me 100 at a time. Also, organizing all this data and converting the stories into feature-vector matrices for MATLAB was tricky given the fact that I was using built-in python hash tables to store the user and link data. These refused to fit in main memory and were causing lots of page-outs to disk. So, and until I can migrate to proper database software, I decided:
   - I would limit myself to 200,000 stories, and just the link, users, and keywords in those 200,000 stories.
   - I would drop the features relating to diggs.
   Even with these limitations, I didn't manage to finish creating the data set and making the feature-vector matrices for MATLAB until May 9th.

Training:

   I used logistic regression as a first attempt at classifying the stories. It took approximately 36 hours for the algorithm to converge on a training set of 100,000 stories with 98 features each.

Each story had the following features:
1. User submitting the story:
- Total number of submissions from that user (limited to the training set)
- Number of submissions reaching the front page (limited to training set)
- Percentage of articles reaching the front page (limited to training set)
- Number of friends
- Number of fans
2. Story Link:
- Total number of stories linking to the same domain (limited to training set)
- Number of stories on the front page with the same domain (limited to training set)
- Percentage of stories with that domain that reached the front page (limited to training set)
3. Title and Description keywords:
- Total number of stories containing that keyword (limited to training set)
- Total number of front page stories containing that keyword (limited to training set)
- Percentage of stories with that keyword that reached the frontpage (limited to training set)
- (limited to the first 10 words of the title, and the first 20 words of the description)

   My user, link and keyword tables from only the training set, and so there would often be unencountered links and users in the test set. Initially, every previously unencountered link or user in the test set would have the corresponding link or user features set to zeros. This caused problems in the results of regression, though: a large number of never-before seen spam websites in the test set were incorrectly classified as popular. This was probably due to the fact that every story in the training set has had its user and link appear at least once, and so the feature counting number of submissions from a website was always > 0. It was also assigned a negative weight during logistic regression, which would have been ignored and discounted for new examples in the test set with this feature set to 0. To counter this problem, I assigned every previously-unencountered user or link in the test set a single failed submission. This got rid of the spam websites.

Results:

   The results of logistic regression were mixed. It turned out that the error on the test set was just 1.12%. The bad part was that the percentage of popular stories in the test set was 0.62%, which means that the regression would have had a smaller percentage error if it had just predicted "unpopular" for every story.
It is more interesting, though, to see how many popular stories out of the 50,000 total stories were correctly recognized or missed. Out of all the stories:

-37 stories were correctly classified popular, out of 313 (11.1%)
-The remaining 276 popular stories were classified incorrectly as unpopular.
-In addition, 92 stories were incorrectly classified as popular. So out of 129 stories guessed to be popular,
28.6% were correct guesses.
-The remaining 49,595 test set stories were correctly classified as unpopular.

   I created a small file containing my results, including the top 15 users, websites and keywords by number of popular stories, number of correct guesses at popular stories, number of incorrect guesses at popular stories, and number of stories which were popular in reality, but missed by the logistic regression. The file is attached at the bottom of the page.

   By examining this file, and also the weights in the vector theta obtained after logistic regression, I made the following observations:
1. The highest-weighted feature, by far, is the percentage of popular submissions from a user.
2. Almost half of the 129 stories classified as popular by the logistic regression were from just 5 users. It classified 20 out of 22 stories by "MrBabyMan" and 13 out of 14 stories from "msaleem" to be popular. The actual number of popular stories from these users was 6 and 4 respectively.
3. Several popular websites (including youtube.com and nytimes.com) had their stories uniformly classified as unpopular, in spite of there being many frontpage submissions from these websites. This could be because there are also huge numbers of unpopular stories from these sites.
4. Some lesser-known websites like apod.nasa.gov and fora.tv have very high front-page rates in both the training set and the test set, despite there being relatively few submissions from each of them.
5. Some interesting keywords that appeared to be relevant to correct prediction were "cloud", "pirate", "space" and "photography"

The logistic regression seems to have given a lot of weight to the features for the users, but this is as expected.

Planned Improvements/Modifications:

- I plan to complete collecting the data set, perhaps using multiple machines, and also to move to a relational database system to mitigate memory issues
- I plan to try using more complex feature vectors, perhaps a quadratic function of these features. In addition, I plan to experiment with other classification methods like SVMs.
- I would like to add some more interesting features, including more detailed information on the title and description, and also some information about the top users who dugg a story. I would also like to include time-to-front page as a feature.
- Adding some hidden variables would offer lots of interesting possibilities, including perhaps predicting total diggs, or time to the front page.

References:

1. Digg API documentation and forum : apidoc.digg.com
2. Some others who have tried similar things:
- SocialBlade
-Dugg Analystics
3. Kurt Wilms, Digg Staff

test_set_analysis.txt
File Size:	10 kb
File Type:	txt

Download File