Predicting Stocks Based on Twitter Sentiment

Robert Denz

Introduction

The United States stock market is hard to predict. Some of this unpredictability relates to investor sentiment, which can be affected by any number of reasons [1]. However, the growth of social networks in recent years has led to an abundance of public information pertaining to a population's sentiment.

One of these social networks is Twitter. Twitter allows its users to post messages of 140 characters or less to be seen by anyone on the internet. Many users post personal information pertaining to their mood. By collecting enough information on a large group of users, a population's day to day sentiment can be modeled.

This sentiment model along with historical stock market data can then be used in machine learning. Specifically, a machine can be trained on the relationship between twitter sentiment and stock market data. With enough training, the machine should be able to make predictions on the future of the stock market based on twitter sentiment.

Method

The first step in this process will be to parse out information pertaining to a user's mood from the twitter dataset. Once the data is sorted, it can then be tallied to find an estimate of the population's sentiment. Next, an advancement of the Support Vector Machine, known as a Support Vector Regression Machine (SVRM) will be implemented [2]. The SVRM is used in this case as predicting the stock market is a regression problem. Predicting the stock market is a regression problem, which is why a SVRM is used.

Datasets

Twitter data will come from over 470 million tweets that were collected by the Stanford Network Analysis Project [3].

Historic stock market data will be obtained from Yahoo Finance.

Timeline

4/12 - 4/19 Process twitter and stock data

4/19 - 5/1 Develop SVRM code

5/1 - 5/5 Begin testing

5/5 - 5/8 Project milestone report

5/9- 5/22 Optimize datasets and SVRM code

5/22 - 5/29 Final Report: Document the methodology and results

References

1. W. Lee, C. Jiang, and D. Indro. Stock Market Volatility, Excess Returns, and the Role of Investor Sentiment. In the Journal of Banking & Finance Volume 26, Issue 12, 2002, Pages 2277-2299.

2. H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik Support Vector Regression Machines. In Advances in Neural Information Processing Systems 9, NIPS 1996, 155-161, MIT Press.

3. J. Yang, J. Leskovec. Temporal Variation in Online Media. In ACM International Conference on Web Search and Data Mining (WSDM '11), 2011.