Information Classification CS 34/134 Project Proposal

Information Classification
CS 34/134 Project Proposal

Mu Lin, Shaohan Hu

April 20, 2009

Background

In today’s world, the amount of information available to people grows at an enormous rate, which leads to the increasing difficulty for people to identify the information that they are really interested in, especially with the absence of pre-given summarized topic of each piece of information, for example a simple text document. Manually labeling the topic of documents is quite time-consuming and tedious. Therefore there has been a growing interest in finding efficient methods that can automatically label the topics of various documents. In this project, we plan to study and apply several different existing algorithms on the document classification problem. We would also like to explore the potential gains and losses of using different ways of combining these methods to approach the document classification problem.

Methodology and Data Set

We will carry out our experiments on the 20 Newsgroups data set [3]. First we study how various existing commonly used Machine Learning methods that are potentially suitable for our task perform on the data set:

Naïve Bayes.
KNN
Decision Tree
SVM
Neural Networks
MaxEnt

Then we would like to study four possible classifier combination approaches:

Random selection - We train every single classifier on the training data. Then for each test document, we randomly select a classifier to label it.
Votings - We train every single classifier on the training data. Then for each test document, we assign its label according to the majority vote or the highest confidence vote.
Weighted Linear Combinations - We train every single classifier on the training data. Then for each test document, we label it using the weighted linear combination of all classifiers, where the weights are determined by their individual test accuracy.
Dynamic Classifier Selection [2] - We train every single classifier on the training data. Then for each test document, we select a single classifier to label it which has the highest local accuracy in a small neighborhood in the feature space surrounding it.

Since our data set is well labelled, evaluation is quite straightforward, and thus will not be described in detail here.

Timeline

April 21: Project proposal;
April 21 - April 30: Study textbook [1] and related papers and choose the set of classifiers we will finally use;
April 30 - May 12: Finish coding and getting results of at least half of the intended experiments;
May 12 - May 22: Finish all the intended experiments;
May 22 - June 2: Analyzing and discussing the results of experiments. Finishing the final write-up and poster.

References

[1] Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer 2006.

[2] Li, YH and Jain, AK, “Classification of text documents”. The Computer Journal: 1998, 41(8):537-546.

[3] Jason Rennie, “Home Page for 20 Newsgroups Data Set”. MIT CSAIL, http://people.csail.mit.edu/jrennie/20Newsgroups/. April 20, 2009.