In today’s world, the amount of information available to people grows at an enormous rate, which
leads to the increasing difficulty for people to identify the information that they are really
interested in, especially with the absence of pre-given summarized topic of each piece of
information, for example a simple text document. Manually labeling the topic of documents is
quite time-consuming and tedious. Therefore there has been a growing interest in finding
efficient methods that can automatically label the topics of various documents. In this
project, we plan to study and apply several different existing algorithms on the document
classification problem. We would also like to explore the potential gains and losses of using
different ways of combining these methods to approach the document classification
problem.
We will carry out our experiments on the 20 Newsgroups data set [3]. First we study how various existing commonly used Machine Learning methods that are potentially suitable for our task perform on the data set:
Then we would like to study four possible classifier combination approaches:
Since our data set is well labelled, evaluation is quite straightforward, and thus will not be described
in detail here.
[1] Christopher M. Bishop, “Pattern Recognition and Machine Learning”, Springer 2006.
[2] Li, YH and Jain, AK, “Classification of text documents”. The Computer Journal: 1998, 41(8):537-546.
[3] Jason Rennie, “Home Page for 20 Newsgroups Data Set”. MIT CSAIL, http://people.csail.mit.edu/jrennie/20Newsgroups/. April 20, 2009.