1 Introduction

Our project targets at the task of information classification. The data set that we use is the 20Newsgroups data set [3]. Some basic statistics of the data set is shown in Table 1.




Number of Classes 20


Size of Vocabulary 61188


Number of Documents18774



Table 1: 20 Newsgroups Data Set Statistics

The main challenge of carrying out classification on the 20Newsgroups data set comes from the number of classes as well as the large number of words in the original vocabulary. Our preliminary results shows that applying classification techniques directly onto the original date set yields rather low performance, in terms of both accuracy and speed. The result comparison of applying kNN, Naïve Bayes and Multi-SVM on a small Spam/Ham data set and on our 20Newsgroups data set is shown in Table 2.





Spam/Ham 20Newsgroups



kNN 89.7% 22.4%



Naïve Bayes 93.7% 49.3%



Multi-SVM 93.1% (Taking too long)




Table 2: Test Accuracy Comparison on Spam/Ham and 20Newsgroups data sets

As can be easily seen, all these classifiers work quite well on the small data set, but perform rather poorly on the 20Newsgroups data set, due to its large number of classes and big vocabulary size. Targeting this, we explore ways of dimensionality reduction and classifier combinations in our project.