Introduction

1 Introduction

Our project targets at the task of information classification. The data set that we use is the 20Newsgroups data set [3]. Some basic statistics of the data set is shown in Table 1.


Number of Classes	20

Size of Vocabulary	61188

Number of Documents	18774

Table 1:

20 Newsgroups Data Set Statistics

The main challenge of carrying out classification on the 20Newsgroups data set comes from the number of classes as well as the large number of words in the original vocabulary. Our preliminary results shows that applying classification techniques directly onto the original date set yields rather low performance, in terms of both accuracy and speed. The result comparison of applying kNN, Naïve Bayes and Multi-SVM on a small Spam/Ham data set and on our 20Newsgroups data set is shown in Table 2.


	Spam/Ham	20Newsgroups

kNN	89.7%	22.4%

Naïve Bayes	93.7%	49.3%

Multi-SVM	93.1%	(Taking too long)

Table 2:

Test Accuracy Comparison on Spam/Ham and 20Newsgroups data sets

As can be easily seen, all these classifiers work quite well on the small data set, but perform rather poorly on the 20Newsgroups data set, due to its large number of classes and big vocabulary size. Targeting this, we explore ways of dimensionality reduction and classifier combinations in our project.

[next] [front] [up]