In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Some forums, such as reddit.com, are so popular that they attract millions of visitors each day and host hundreds of millions of user posts [1]. Although forums rely on posts to attract users, some posts detract from a forum’s appeal and therefore are considered unacceptable. Posting standards vary forum to forum, but some traits of unacceptable posts include [2][3]:
To ensure their integrity, many forums rely on moderators to spend countless hours reading through all the new posts and manually determining which posts are unacceptable. This is a huge drain on human resources that should be avoided if possible.
The aim of this project is to teach the computer an algorithm to determine the probability that a given post is unacceptable, and then automatically flag unacceptable posts.
We quickly realized that the challenges associated with our project are similar to those involved in spam email filtering systems. Both involve text classification with the goal of automating removal of “positive” results – in our case, an unacceptable forum post. Research into text classification indicated that Naive Bayes and Logistic Regression models are two “classic” methods for building spam filtering systems. We will therefore begin our by developing naive Bayes and logistic regression models for our automated forum moderator project. We will employ smoothing terms where necessary. Because the naive Bayes model is relatively simple and conducive to quick parameter Maximum Likelihood Estimation, we will begin our development of an automated forum moderator with this model. We will then move to develop a logistic regression model, a slightly more difficult task due to the need to use gradient ascent techniques to determine the Maximum Likelihood Estimation of the error function parameters. Once we have developed these two models we will conduct analyses on their absolute and comparative accuracy and efficiency.<\p> We also plan to adapt less more novel methods of spam filtering to the problem of automated forum filtering. Recent research indicates that methods such as random forests, support vector machines, and boosted decision trees may lead to better model accuracy, speed, and portability. With this in mind, we will choose one of these methods to attempt to improve on the naive Bayes and logistic regression models. As of this moment we are interested in using random forests – both because the research trends point towards increasing use of this model and the fact that random forest algorithms are really cool.