Automated Forum Moderation

Randy Li and Brendan Murphy

CS 74, Winter 2013

Background

In the digital age, online forums have become a popular venue for people to share interesting content and hold engaging discussions. Some forums, such as reddit.com, are so popular that they attract millions of visitors each day and host hundreds of millions of user posts [1]. Although forums rely on posts to attract users, some posts detract from a forum’s appeal and therefore are considered unacceptable. Posting standards vary forum to forum, but some traits of unacceptable posts include [2][3]:

insults to other users
gratuitous profanity
bigotry
lack of content
advertising and spam
malicious content (malware, spyware)
illegal content (child porn, warez)
off-topic posts

Problem

To ensure their integrity, many forums rely on moderators to spend countless hours reading through all the new posts and manually determining which posts are unacceptable. This is a huge drain on human resources that should be avoided if possible.

Goal

The aim of this project is to teach the computer an algorithm to determine the probability that a given post is unacceptable, and then automatically flag unacceptable posts.

Methods

We quickly realized that the challenges associated with our project are similar to those involved in spam email filtering systems. Both involve text classification with the goal of automating removal of “positive” results – in our case, an unacceptable forum post. Research into text classification indicated that Naive Bayes and Logistic Regression models are two “classic” methods for building spam filtering systems. We will therefore begin our by developing naive Bayes and logistic regression models for our automated forum moderator project. We will employ smoothing terms where necessary. Because the naive Bayes model is relatively simple and conducive to quick parameter Maximum Likelihood Estimation, we will begin our development of an automated forum moderator with this model. We will then move to develop a logistic regression model, a slightly more difficult task due to the need to use gradient ascent techniques to determine the Maximum Likelihood Estimation of the error function parameters. Once we have developed these two models we will conduct analyses on their absolute and comparative accuracy and efficiency.<\p> We also plan to adapt less more novel methods of spam filtering to the problem of automated forum filtering. Recent research indicates that methods such as random forests, support vector machines, and boosted decision trees may lead to better model accuracy, speed, and portability. With this in mind, we will choose one of these methods to attempt to improve on the naive Bayes and logistic regression models. As of this moment we are interested in using random forests – both because the research trends point towards increasing use of this model and the fact that random forest algorithms are really cool.

Data Sets

Forum activity from Teamliquid.net and SomethingAwful.com. Unacceptable posts on these sites are clearly marked, which eases the task of extracting the posts from the webpages downloaded by a crawler.
A set of insulting comments from Kaggle.com contest data (http://www.kaggle.com/c/detecting-insults-in-social-commentary)
Research data containing unacceptable Usenet posts from Dr. Melanie Martin (pending)

Timeline

2/2 – gather required data from Teamliquid.net and SomethingAwful.com and run Naļve Bayes on training data
2/9 – Run logistic regression on the training data
2/16 – One of boosted decision tree, random forest, or support vector machine implemented and ran
2/19 – Milestone presentation
3/2 – Complete poster presentation preparation
3/6 – Final write-up draft finished
3/7 – Poster presentation
3/8 – Final write-up

References

[1] http://www.wolframalpha.com/input/?i=reddit.com
[2] http://www.teamliquid.net/forum/viewmessage.php?topic_id=17883
[3] http://www.somethingawful.com/d/forum-rules/forum-rules.php
[4] http://arxiv.org/pdf/cs/0110053.pdf
[5] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.122.5901
[6] http://www.cs.princeton.edu/courses/archive/spr08/cos424/scribe_notes/0214.pdf
[7] http://en.wikipedia.org/wiki/Bayesian_spam_filtering
[8] http://link.springer.com/content/pdf/10.1023%2FA%3A1010933404324
[9] http://melodi.ee.washington.edu/~halloj3/classification.pdf
[10] www.scirp.org/journal/PaperDownload.aspx?paperID=1524