Information Classification
CS 34/134 Project Milestone

Mu Lin, Shaohan Hu

May 12, 2009

Progress Summary

We have finished most of the coding work: We implemented the majority of the classifiers we intend to use in the project. Also we carried out our experiments on the 20 Newsgroup dataset as well as on the small spam/ham dataset included in homework 2. The results are shown in the following section.

Progress detail

Classifiers

We have by now implemented the following classifiers:

Combination Strategy

We have so far implemented three classifier combination approaches:

Experiments results

The 20 Newsgroup dataset is quite large containing about 20,000 examples and more than 60,000 features, it takes a long time to finish all the training and test for all classifiers and combination strategies. So at first we did experiments on the a1spam data set included in homework 2, Figure 1 is the plot figure of different classifiers and combination strategies’ test errors:


PIC

Figure 1: Test errors plot on spam/ham dataset

Figure 2 below is the error plot obtained from the experiments on the 20 Newsgroup dataset.


PIC

Figure 2: Test errors plot on 20NewsGroup data

 As can be seen from Figure 2, results from the SVM classifier is missing; that is because of the large size of the 20NewsGroup data—the SVM experiment is still running as of writing, which actually gives rise to one of the TO-DOs that we describe in next section. What’s also worth noting from the results is that highest confidence voting actually does perform better over all other approaches, which is nice to see.

Next Step