In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to learn to identify insulting posts.
The work of converting the raw csv data format from the Kaggle website to a matrix structure was done by my initial project partner. The text classification algorithms I chose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
The first models I used were naive Bayes and decision trees, because they are simple models that I have already implemented for class homework. I applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for my future work.
An issue with my initial project proposal was that I did not include a method that was not covered in class. I addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. I settled on the RIPPER2 algorithm as the method of choice to try to approach the performance of the winning entry in the Kaggle Competition.[1][2]
While I was aware of the presence of an open source “Jrip”-named version of RIPPER2 on the WEKA machine learning software download, I decided to write my own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. I used several other sources to fill in details involving the IREP* portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4].
RIPPER2 works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the training data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily to maximize a representation of information gain called FOIL information gain. Every possible combination of the rule is generated and evaluated based on performance on the pruning set. The combination with maximum performance on the pruning set is then added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Finally, RIPPER2 grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
Area under the curve - AUC - is a measure designed to test the usefulness or power of a predictive model. AUC is computed by determined the area of the Receiver Operating Characteristic curve, which is generated by comparison of the true positive/total positive to false positive/total negative rations. Test with an AUC of <= .50 are functionally useless. The winning Kaggle entry had an AUC of .83. The graph below represents the ROC curves for each test - the AUC and error values are tabulated below. [1][5]
In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to learn to identify insulting posts.
The work of converting the raw csv data format from the Kaggle website to a matrix structure was done by my initial project partner. The text classification algorithms I chose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
The first models I used were naive Bayes and decision trees, because they are simple models that I have already implemented for class homework. I applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for my future work.
An issue with my initial project proposal was that I did not include a method that was not covered in class. I addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. I settled on the RIPPER2 algorithm as the method of choice to try to approach the performance of the winning entry in the Kaggle Competition.[1][2]
While I was aware of the presence of an open source “Jrip”-named version of RIPPER2 on the WEKA machine learning software download, I decided to write my own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. I used several other sources to fill in details involving the IREP* portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4].
RIPPER2 works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the training data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily to maximize a representation of information gain called FOIL information gain. Every possible combination of the rule is generated and evaluated based on performance on the pruning set. The combination with maximum performance on the pruning set is then added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Finally, RIPPER2 grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
Area under the curve - AUC - is a measure designed to test the usefulness or power of a predictive model. AUC is computed by determined the area of the Receiver Operating Characteristic curve, which is generated by comparison of the true positive/total positive to false positive/total negative rations. Test with an AUC of <= .50 are functionally useless. The winning Kaggle entry had an AUC of .83. The graph below represents the ROC curves for each test - the AUC and error values are tabulated below. [1][5]
In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to learn to identify insulting posts.
The work of converting the raw csv data format from the Kaggle website to a matrix structure was done by my initial project partner. The text classification algorithms I chose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
The first models I used were naive Bayes and decision trees, because they are simple models that I have already implemented for class homework. I applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for my future work.
An issue with my initial project proposal was that I did not include a method that was not covered in class. I addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. I settled on the RIPPER2 algorithm as the method of choice to try to approach the performance of the winning entry in the Kaggle Competition.[1][2]
While I was aware of the presence of an open source “Jrip”-named version of RIPPER2 on the WEKA machine learning software download, I decided to write my own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. I used several other sources to fill in details involving the IREP* portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4].
RIPPER2 works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the training data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily to maximize a representation of information gain called FOIL information gain. Every possible combination of the rule is generated and evaluated based on performance on the pruning set. The combination with maximum performance on the pruning set is then added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Finally, RIPPER2 grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
Area under the curve - AUC - is a measure designed to test the usefulness or power of a predictive model. AUC is computed by determined the area of the Receiver Operating Characteristic curve, which is generated by comparison of the true positive/total positive to false positive/total negative rations. Test with an AUC of <= .50 are functionally useless. The winning Kaggle entry had an AUC of .83. The graph below represents the ROC curves for each test - the AUC and error values are tabulated below. [1][5]
In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to learn to identify insulting posts.
The work of converting the raw csv data format from the Kaggle website to a matrix structure was done by my initial project partner. The text classification algorithms I chose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
The first models I used were naive Bayes and decision trees, because they are simple models that I have already implemented for class homework. I applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for my future work.
An issue with my initial project proposal was that I did not include a method that was not covered in class. I addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. I settled on the RIPPER2 algorithm as the method of choice to try to approach the performance of the winning entry in the Kaggle Competition.[1][2]
While I was aware of the presence of an open source “Jrip”-named version of RIPPER2 on the WEKA machine learning software download, I decided to write my own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. I used several other sources to fill in details involving the IREP* portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4].
RIPPER2 works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the training data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily to maximize a representation of information gain called FOIL information gain. Every possible combination of the rule is generated and evaluated based on performance on the pruning set. The combination with maximum performance on the pruning set is then added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Finally, RIPPER2 grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
Area under the curve - AUC - is a measure designed to test the usefulness or power of a predictive model. AUC is computed by determined the area of the Receiver Operating Characteristic curve, which is generated by comparison of the true positive/total positive to false positive/total negative rations. Test with an AUC of <= .50 are functionally useless. The winning Kaggle entry had an AUC of .83. The graph below represents the ROC curves for each test - the AUC and error values are tabulated below. [1][5]
In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Unfortunately, some users choose to post material considered unacceptable to others. The goal of this project is to teach a computer to learn to identify insulting posts.
The work of converting the raw csv data format from the Kaggle website to a matrix structure was done by my initial project partner. The text classification algorithms I chose to implement require that the data be in one of two formats.
1) A matrix such that each row represents a different post, each column represents a different word, and the (i,j) element means is the frequency that word j appears in post i.
2) A matrix such that each row represents a different post, and each entry i in a row means word i appears in the post.
After all of the words in all of the posts have been added to the data structure, a word list is compiled by iterating through all the word nodes in all of the hash slots. Each word is assigned a number, and a matrix in the format #1 is created by iterating through all of the post nodes. Converting the matrix to format #2 is a simple matter of iterating through the matrix and finding which words occur in each post.
Due to the huge number of words found (on the order of 10000), not all words can be used during computation. Therefore, for convenience, if a word occurs less than 10 times across all posts, it was not included in the matrix.
The first models I used were naive Bayes and decision trees, because they are simple models that I have already implemented for class homework. I applied them to the Kaggle data, and the results along with the top performance in the Kaggle competition (12% classification error[1]) provide a benchmark for my future work.
An issue with my initial project proposal was that I did not include a method that was not covered in class. I addressed this problem by researching other algorithms that were shown to build efficient and accurate text classifiers of large and noisy data sets. I settled on the RIPPER2 algorithm as the method of choice to try to approach the performance of the winning entry in the Kaggle Competition.[1][2]
While I was aware of the presence of an open source “Jrip”-named version of RIPPER2 on the WEKA machine learning software download, I decided to write my own copy of the algorithm based on the pseudo code provided by Cohen in his original description of the algorithm [2]. I used several other sources to fill in details involving the IREP* portion of the algorithm as well as in development of our minimum description length (MDL) estimating equation [2][3][4].
RIPPER2 works in three stages: a rule-growth state (operating on 2/3 of the training data), a pruning state (operating on the remaining 1/3 of the training data set), and an optimization stage - the optimization stage is used to prevent both overfitting and premature termination of the algorithm [2]. The rules are grown greedily to maximize a representation of information gain called FOIL information gain. Every possible combination of the rule is generated and evaluated based on performance on the pruning set. The combination with maximum performance on the pruning set is then added to the rule-set. The optimization stage makes use of an MDL heuristic that allows the algorithm to choose between competing rules and hypothesized rules. In this stage, the algorithm attempts to compress the rule-set by identifying and eliminating rules that increase the description length of the rule set. Finally, RIPPER2 grows two competing hypotheses for each rule and selects that which has the lowest description length. The optimization phase is run twice in order to reprocess and grow rules on portions of the data set that were uncovered by the pruning and optimization processes [2].
Area under the curve - AUC - is a measure designed to test the usefulness or power of a predictive model. AUC is computed by determined the area of the Receiver Operating Characteristic curve, which is generated by comparison of the true positive/total positive to false positive/total negative rations. Test with an AUC of <= .50 are functionally useless. The winning Kaggle entry had an AUC of .83. The graph below represents the ROC curves for each test - the AUC and error values are tabulated below. [1][5]
The results are different than those I attained for the project presentation. I discovered errors in tabulating the error rates for the RIPPER2, IREP*, and Decision Tree models. The RIPPER2 and IREP* models were functioning properly, they were simply miscounting the errors. I felt that this was a necessary correction to make for the project writeup. The change in error tabulation certainly impacted the results that are seen in the table and graph above. First of all, Naive Bayes clearly is the best method that I tested. The Decision Tree and RIPPER2 models are competitive in terms of AUC, but the Decision Tree model is far superior in terms of error rate. Finally, the IREP* method is essentially a useless test, with an AUC of around .5 for all values of C, the minimum description length constant.
At this point it is important that I discuss the minimum description length weighting constant. When I first received my results from RIPPER2, I was not pleased with the results and wanted to look for ways to improve the model. In Cohen's paper in 1995, he indicates that there exists the potential to give different weights to false positives and false negatives by adding a weighting factor to the corresponding component of the Minimum Description Length equation. I experimented with four values of a constant that would weight the false positives in the Minimum Description Length function. These values were C = [.5, 1, 1.5, 2.0]. The baseline is C = 1, and is the value for which I ultimately attained the best results, as seen in the image below.
I did not include the results for IREP* because the AUC values for this method were approximately .50 for all values of C. My conclusions from the different weighting values of C are as follows: as expected, there is a serious decrease in performance when the weighting value weights false negatives as worse than false positives (i.e. C = .5). This is likely due to the dominance of negative examples in the data set - the positive instances, insults, are only present at a .2796 rate in the test set. Therefore, performance generally improves as the weighting factor weights false positives as more damaging. However, performance increases are certainly mild when the AUC results are compared to Naive Bayes.
Finally, it is necesssary to consider why the general results favored the Naive Bayes over the Rule Learners (RIPPER2 and IREP*) and Decision Tree models. After following the performance of the RIPPER2 model closely, I determined that the data set itself favors the Naive Bayes model. The data set for the training set is small, with only 1974 examples and 480 positive examples. The RIPPER2 (and IREP*) models break up this already small training set into a the growth and prune sets. The pruning set tended to have around 180 positive examples - what I determined was happening was that the RIPPER2 and IREP* models were essentially over-pruning, whereas the Decision Tree was overfitting. This is supported by the very similar training and testset errors for the RIPPER2 and IREP* models, while the Decision Tree model had a greater disparity between the training and test error, indicating overfitting. The overpruning of the RIPPER2 and IREP* models was due exceedingly general rules being selected during the pruning stage due to a lack of the necessary number of positive results to allow for more specific rules to be chosen. Therefore, the size of the data set and the requirement of the RIPPER2 and IREP* models to split the training set into a very small pruning set rendered these models very poor choices for this particular challenge. Naive Bayes, on the other hand, with its assumptions of independence and ability to factor many different words into the decision rule for each example, did much better because it was not forced to decide on a ruleset or decision tree in order to make decisions. Thus, with small, very irregular, high dimensional data sets such as this, it is clear that straight-forward rule-learning models are very poor choices, and that Naive Bayes should be considered along with other models such as SVM's, Adaboost, and Random Forests, all of which allow for more general decision rules that are not so strongly declared as a three word rule.
The challenge of completing this project on my own was significant, and I am proud of the work that I did despite the results being lacking for my models of choice (RIPPER2 and IREP*). I demonstrated the weaknesses of rule learners on small data sets and demonstrated the potential negative impact of overpruning being as dangerous as overfitting in some cases.