Our project targets at the task of information classification. The data set that we use is the
20Newsgroups data set [3]. Some basic statistics of the data set is shown in Table
1.
|
| Number of Classes | 20 |
|
| Size of Vocabulary | 61188 |
|
| Number of Documents | 18774 |
|
| |
Table 1: | 20 Newsgroups Data Set Statistics |
|
The main challenge of carrying out classification on the 20Newsgroups data set
comes from the number of classes as well as the large number of words in the original
vocabulary. Our preliminary results shows that applying classification techniques directly
onto the original date set yields rather low performance, in terms of both accuracy
and speed. The result comparison of applying kNN, Naïve Bayes and Multi-SVM on
a small Spam/Ham data set and on our 20Newsgroups data set is shown in Table
2.
|
|
| | Spam/Ham | 20Newsgroups |
|
|
| kNN | 89.7% | 22.4% |
|
|
| Naïve Bayes | 93.7% | 49.3% |
|
|
| Multi-SVM | 93.1% | (Taking too long) |
|
|
| |
Table 2: | Test Accuracy Comparison on Spam/Ham and 20Newsgroups data sets |
|
As can be easily seen, all these classifiers work quite well on the small data set, but perform
rather poorly on the 20Newsgroups data set, due to its large number of classes and big
vocabulary size. Targeting this, we explore ways of dimensionality reduction and classifier
combinations in our project.