In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to flag such posts automatically.
The text classification algorithms we choose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
Unfortunately, the data sets available to us are not in these formats. The most processed data, the Kaggle data set, is in raw text format, i.e. "My name is John Doe". The data must be converted to one of the above formats, which requires keeping a list of words that occur over all the posts, and counting the number of times each word appears for each post. Teamliquid posts require even more processing, since the come in a HTML format.
Parsing the Kaggle data was done with a program written in C (source here). The data appears in the CSV (comma separated values) format. The first few lines of a file is as follows:
The date column is irrelevant for this project, so the column was deleted using Microsoft Excel. Then, the labels and and posts were read into a labels array and a posts array. The posts were normalized by replacing all non-alphabetic characters with spaces and converting all uppercase characters to lowercase. The hardest and final step was creating a specialized data structure to keep track of all the words found so far and counting the number of occurences for each post.
Each word that is found is hashed, and then added to the hash table as a word node, if it has not been found before. To deal with collisions, the word nodes form a linked list with all the other words that hash to the same hash. The post the word is found in is attached to the word node as a post node, with other posts containing the same word linked in a linked list. Each post node also has a count of how many times the word has been found in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
Parsing Teamliquid involved all the work needed to parse Kaggle data, with the additional steps of downloading HTML pages and extracting posts from those pages. The HTML pages were downloaded by looping through Teamliquid thread URLs, which are always in the format http://www.teamliquid.net/forum/viewmessage.php?topic_id=[thread id number], where [thread id number] is the thread's unique identifier. These URLs were fed to the Linux command line utility wget. (source here)
After the web pages were downloaded, they had to be parsed for posts. Posts on Teamliquid are marked with the HTML tags <td class='forumPost' width='100%'> and </td>, so the parser that was written looked for these tags, and outputted all the content between these tags to separate files (one file per post). Afterwards, the posts were converted to matrices with the same procedure as the Kaggle data (source here).
The first models we used were naive Bayes and decision trees, because they are simple models that we have already implemented for class homework. We applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for our future work. The following is a summary of our results.
In the above graph, the parameter C refers to the minimum percentage of the training set size a node in the decision tree can be. Information gain was the metric used for feature selection in the decision tree, and the decision rule tested whether the feature selected at the branch is greater than 0. As you can see, the performance of these basic algorithms have a long way to go to approach the Kaggle-best 12% error rate.
An issue with our project proposal was that we did not include a method that was not covered in class. We addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. We settled on the RIPPER algorithm as our method of choice to try to approach the 12% error rate of the winning entry in the Kaggle Competition.[1][2]
While we were aware of the presence of an open source “Jrip”-named version of RIPPER on the WEKA machine learning software download, we decided to write our own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. We used several other sources to fill in details involving the IREP portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4]. We decided to program our own algorithm from scratch for a few reasons: 1) we wanted the experience of programming a complex and lengthy algorithm 2) we felt it was more in line with the term project's guidelines and 3) we wanted to implement the algorithm in MATLAB and thus have access to MATLAB's high level functions and figure-generating abilities.
RIPPER works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily and are pruned before being added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Then, RIPPER grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
This repeated rule growth and optimization requires that many loops are run over the matrix containing the data. Because of this, in order to reduce computational requirements it is recommended that when using RIPPER the data is formulated in a matrix of "strings," each corresponding to one of the forum posts in our data set [2]. Each string contains the "words" that were in the forum post, with each word represented by a unique number (i.e. "loser" = 119, "dumb" = 34, etc.). In order to standardize the size of the row vectors in the matrix we fill in zeros after the forum post has been described.
We ran into problems implementing the algorithm due to the difficulty of manipulating this awkward data structure. Our MATLAB code currently contains 12 different functions and is quite lengthy, and we are in the process of determining the specific bugs that need to be fixed in order to begin testing our RIPPER model. We have identified several problems involved with how the algorithm is processing and removing rows from the data set and rule-set while in the rule growth and pruning phases. We have set a deadline of this Sunday to achieve full functionality of the model and feel like we are in a good place to do so.
We are currently behind our milestone goal of having 3 algorithms implemented and ran on the data. Although most of the coding is done, there is a lot of debugging to make the code actually run, and then a lot of refining to do to get good results. Since unacceptable forum posts are relatively rare, we are currently looking at better feature selection methods than frequency-based feature selection, such as chi-squared and mutual information. There are also a few adjustments in the RIPPER algorithm we want to experiment with, such as different interpretations of the MDL heuristic and utilizing ensemble methods.