Smart Stock Prediction

Chanjuan Wen

1 Goal

My goal for this project includes the following two parts:

Based on stock price data of the previous week, we use supervised machine learning algorithms to predict the price of stocks for the next day.
After defining a set of features as the measurement for good performance of stocks, we use unsupervised machine learning algorithms to identify a set of stocks with good performance.

2 Methods

2.1 Supervised Learning Algorithm

For the preliminary analysis, I mainly use Support Vector Machine(SVM) to do the prediction. SVM is acknowledged by many people as the best supervised learning algorithm for data analysis and pattern recognition[1]. In this project, we use SVM to predict the next day’s stock price. Based on the price data in the previous week, we use four kernels for SVM as following:

Linear kernel: K(x⁽ⁱ⁾, x^(j)) = (x⁽ⁱ⁾)^T(x⁽ⁱ⁾) + c
Quadratic kernel: K(x⁽ⁱ⁾, x^(j)) = 1 − (||x⁽ⁱ⁾ − x^(j)||)/(||x⁽ⁱ⁾ − x^(j)|| + c)
Polynomial kernel: K(x⁽ⁱ⁾, x^(j)) = (a(x⁽ⁱ⁾)^T(x^(j)))^d
Gaussian kernel: K(x⁽ⁱ⁾, x^(j)) = exp( − (||x⁽ⁱ⁾ − x^(j)||²)/(2σ²))

2.2 Unsupervised Learning Algorithm

For the classification of stocks with good performance, I plan to use some unsupervised learning algorithm, such as k-means and mixture of Gaussians. In statistics and data mining, k-means clustering algorithm aims to partition n observations into k clusters[2]. Another unsupervised learning algorithm, Mixture of Gaussians, can also be used to deal with clustering problems. In this project, we will use this two algorithms to partition a big set of stocks into some clusters, according to the measurement we define.

3 Dataset and Features

We retrieve historic stock data from Yahoo Finance by Yahoo Stock API. The dataset we collected mainly includes two parts:

Historic stock price. We collected the historic price data of seven stocks in information technology sector of Yahoo Finance. They are Apple, Autodesk, Cisco, Intel, Microsoft, Oracle, Yahoo. The features of these stocks includes: Date, Symbol, Open Price, High Price, Low Price, Low Price, Volume. The size of this data set is 1650. Figure 1 displays part of the data.

Figure 1 Historic Stock Price
General stock information. We also collected some general information about S&P500 stock index, including current price, sum of trading volume, P/E, 200-day Moving Average, Change percent, Market Capitalization. This size of this data set is 500. Figure 2 displays part of the data.

Figure 2 General Stock Information

4 Preliminary Results

By using four kernels of SVM, we have conducted several tests in the price prediction of seven stocks as following:

Price prediction of next day base on previous day’s price data. As illustrated in the figure 4, we can see that the rate of wrong prediction is pretty high for all the four kernels, at about 30% ~ 50%.

Figure 3 price prediction of next day base on previous day’s price data
Price prediction of next day base on previous week’s price data. As illustrated in figure 5, when we use the data of previous week to predict the next day’s price, compared to the previous case, the number of errors is much smaller. Especially, the quadratic kernel achieved the best performance. The rate of wrong prediction is at about 13%.

Figure 4 price prediction of next day base on previous week’s price data
Price prediction of next week based on one day’s price data. Basically, this prediction doesn’t have much value. And our experiment results also show that the number of errors are pretty high, nearly 50%.

Figure 5 price prediction of next week based on one day’s price data

5 Remaining Tasks

My remaining tasks include the following things:

Optimize the prediction for stock price. On one hand, I will try to collect more data with more valueable features which will result in more precise prediction. On the other hand, I will try to implement other classification algorithms, such as decision tree, to compare the results with what the SVM produces.
Identify a set of stocks with good performance. I will analyze the data and define some features for stock with good performance. Then I will implement at least two unsupervised learning algorithms, such as K-means and Mixture of Gaussians, compare the results.

6 Timeline

Period	Task
May 10 - May 20	Implement the classifer for optimal set of stocks
May 21 - May 30	Optimization, testing and polishing results
May 31	Final report

References

[1] http://en.wikipedia.org/wiki/Support_vector_machine

[2] http://en.wikipedia.org/wiki/K-means_clustering

[3] Christopher King, Christophe Vandrot, John Weng. A SVM Approach to Stock Trading, 2009.

[4] Hung Pham, Andrew Chien, Youngwhan Lim. A Framework for Stock Prediction, 2009.