Information Classification
CS 34/134 Project Proposal

Mu Lin, Shaohan Hu

April 20, 2009

Background

In today’s world, the amount of information available to people grows at an enormous rate, which leads to the increasing difficulty for people to identify the information that they are really interested in, especially with the absence of pre-given summarized topic of each piece of information, for example a simple text document. Manually labeling the topic of documents is quite time-consuming and tedious. Therefore there has been a growing interest in finding efficient methods that can automatically label the topics of various documents. In this project, we plan to study and apply several different existing algorithms on the document classification problem. We would also like to explore the potential gains and losses of using different ways of combining these methods to approach the document classification problem.

Methodology and Data Set

We will carry out our experiments on the 20 Newsgroups data set [3]. First we study how various existing commonly used Machine Learning methods that are potentially suitable for our task perform on the data set:

Then we would like to study four possible classifier combination approaches:

Since our data set is well labelled, evaluation is quite straightforward, and thus will not be described in detail here.

Timeline

References

[1]   Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer 2006.

[2]   Li, YH and Jain, AK, “Classification of text documents”. The Computer Journal: 1998, 41(8):537-546.

[3]   Jason Rennie, “Home Page for 20 Newsgroups Data Set”. MIT CSAIL, http://people.csail.mit.edu/jrennie/20Newsgroups/. April 20, 2009.