Smart Stock Selection
Final Report

Chanjuan Wen

May 31, 2011

1 Introduction

Stock represents a portion of ownership of a company. Distinct from the property and the assets of a business, stock could fluctuate in quantity and value[1]. For stock traders, fluctuation in stock’s price directly determine their profit or loss. Therefore, to maximize the profit, it is of great value to choose those stocks with high return. Stock market prediction is to determine the future price of a company stock or other financial products traded on a financial exchange[2]. Obviously, successful stock prediction can yield significant profit. However, accurately predicting the performance of a stock is very difficult for the following reasons:
But many past studies has showed that stock returns are predictable based on historical information of that stock[3]. Many researchers apply machine learning techniques to stock prediction. Avramov and Chordia’s work showed that returns on individual stocks are predictable[4]. Hamid and Iqbal[5] used neural networks for forecasting volatility of the S&P 500 Index futures options. Comparing to statistical methods, Machine Learning Methods don’t involve assumptions about the distribution of dataset, thus are very suitable for stock prediction[3].

Since long term invesetment(months or years) requres fundamental analysis of the company’s performance and financial record, and super short-term (1-2 days) investment will be only available to large financial coorperation due to the high commission fee and need high quality financial data, in this project, I mainly focus on short-term stock trading (1-2 weeks).

So, the goal of this project is to help investors make short-term stock investment by constructing a profitable stock portfolio using machine learning techniques. A portfolio is a set of stocks the investor will hold in a prerid of time. An investor is able to gain money by buying stocks in low price and selling them in higher price. To maximize the profit of a profolio, investors need to select stocks with a high return rate. Let Vi denote the initial value of an investment, and Vf denote the final value of it, then the return rate is[6]:
rarith  =  (Vf − Vi)/(Vi)

2 Previous Work

Previous work on stock prediction mainly falls into several categories as follows:

3 Objective

As analyzed previously, the return of a stock directly reflects the amount of profit, and thus has a significant economic value. By assuming no commission cost for trading stocks, my objective of this project is to select stocks to form a portfolio with high return. For each week t, based on the historic trading information prior to week t, the return rate for each stock is predicted and each stock is assigned a rank based on its predicted return. Then we chose N stocks with the highest rank to form the portfolio of size N for week t.

4 Approach

4.1 Competitive Learning Based Non-parametric Model

Since we have a very large dataset and the data grows every trading day, I want my model to grow with the data. So, I used a non-parametric model based on a competitive learning method[9]. As illustrated in the below figure, my algorithm is composed of three steps as follows:
  1. Cluster Initialization. Firstly, in the 3-D feature sample space, I randomly choose k samples as the cluster centers. Then in the K-means method, firstly, for each feature sample, I calculate its distance to every cluster center and choose the closest cluster to label it. Then by fixing the cluster label of each feature sample, I update the cluster center by averaging all the feature vector belonging to that cluster. By repeat this process until convergence, I can get k cluster. Meanwhile, for each cluster center, the average feature vector and weekly return are also recorded. After getting the k clusters, I use the feature vectors of the k cluster centers to construct a Kd-Tree for nearest neighbor search in the prediction stage.
  2. Return Prediction. For each stock i in week t denoted by x(i)t, I use Kd-Tree to find its nearest cluster c, and use c’s average return as the predicted return for x(i)t. After predicting the weekly return for all stocks, I choose N stocks with the highest return to form the portfolio.
  3. Cluster Update. At the end of week t, we have the actual return of all stocks for week t. So, we can update the clusters. There are two things needed to update as follows:
    • Update the cluster centers. For each stock i, I use its feature vector to update the center of cluster which i belongs to. Let xjavg denote the average feature vector of cluster j, n denote the number of feature samples belonging to j, then the update equation is: xjavg = (xjavg*n + x(i)t)/(n + 1).
    • Update the average return of each cluster. For each stock i, I use i’s actual return at week t to update the average return of cluster which i belongs to. Let rjavg denote the average return of cluster j, n denote the number of feature samples belonging to j, then the update equation is: rjavg = (rjavg*n + r(i)t)/(n + 1).
figure figures/algorithm.png

Figure 1 Three steps of my algorithm

4.2 Other Models for Comparison

To measure the economic profit by using my method for weekly return prediction, I also implement another two models to be the control group: Linear Regression and Nonlinear Regression. They are decribed in details in the following two subsections.

4.2.1 Linear Regression

As one of the supervised learning algorithm, linear Regression uses a linear functions to model the relationship between a scalar variable y and multiple variables X[10]. The linear function is usually in the following form:
fθ(x)  =  θ0 + θ1x1 + θ2x2 + ⋯ + θnxn

Here, the θi’s are the parameters parameterizing the space of linear functions mapping from X to Y[11]. Given a training set, we choose θi’s which can minimize the sum of squared differences error denoted as follows:
E(θ)  =  (1)/(2)mi = 1(fθ(x(i)) − y(i))2

4.2.2 Nonlinear Regression

To avoid underfitting may caused by Linear Regression, Nonlinear Regression uses a nonlinear combination of the model parameters and depends on multiple independent variables X[12]. The nonlinear function I choose is in the following form:
fθ(x)  =  θ0 + θ1x1 + ⋯ + θnxn + θn + 1x21 + ⋯ + θ2nx1xn + θ2n + 1x22 + ⋯θ3n − 1x2xn + ⋯ + θ((n + 1)(n + 2))/(2)x2n

Similar to Linear Regression, we choose θi’s which minimize the sum of squared differences errors. Since some nonlinear functions may cause overfitting while some other may cause underfitting, choosing a good nonlinear function is very important.

5 Experiment

5.1 Data Retrieving

Using Yahoo Stock API, I collected the historic data of all stocks in S&P 500 index from April 2001 to April 2011. For each trading day in that period, I retrieve the daily price and trading volume of each stock. After omitting stocks with missing data, the final dataset includes the daily price and trading volume of 462 stocks for 2515 days.

5.2 Data Preprocessing

In this project, I choose to predict the weekly return, instead of daily return, of each stock. So, I pad the trading days into weeks. Each week is composed of five trading days. Let R(t) denote the return of week t, d denote the last day before week t, p(d) denote the price of day d, and V(t) denote the total trading volume of week t. Then we have:
R(t)  =  (p(d + 5) − p(d))/(p(d)) V(t)  =  d + 5j = d + 1v(j)

So, after preprocessing, my dataset includes 500 weeks data of 462 stocks.

5.3 Feature Selection

As discussed in the previous section, the prediction for return of week t is based on the actual return and trading volume of week t − 1 and t − 2. So, for the following three methods, I choose difference feature vectors.

5.4 Parameter Configuration

In 500 weeks historic data of 462 stocks, I choose the first 149 weeks data as the training set, and the rest 351 weeks data as the test set. About the portfolio size, the experiments showed that N = 10 is a good portfio size which can yield pretty high return. Similarly, I also did several test for the number of clusters and found that k = 50 is a good choice.

6 Results

To show the performance of my algorithm, I plot the cumulative return with respect to the increasing number of weeks. As illustrated in the below figure, the x-axis represents the number of weeks, and the y-axis represents the cumulative weekly return rate. The red curve denotes the performance my algorithm, and the yellow and green curves denote that of the linear model and nonlinear model separately. As we can see, my algorithm performs much better than the linear regression model and nonlinear regression model.
figure figures/results.png

Figure 2 Cumulative Return with Respect to the Increase of Weeks

7 Conclusion

From the dataset tested, we can conclude that the portfolio constructed by my non-parametric model consistently yield higher cumulative return than those constructed by the linear regression and non-linear regression model. In a time period of 7 years, using my model to select stocks for each week can achieve 250% cumulative return.

References

[1] http://en.wikipedia.org/wiki/Stock

[2] http://en.wikipedia.org/wiki/Stock_market_prediction

[3] Robert J. Y, John N., Charles X. L. Application of Machine Learning to Short-term Equity Return Prediction. 2006.

[4] Avramov D., Chordia T. Predicting stock returns, Journal of Financial Economics 82, 387-415. 2006.

[5] Hamid S.A., Iqbal Z. Using neural networks for forecasting volatility of S&P 500 Index futures prices, Journal of Business Research 57, 1116-1125. 2004.

[6] http://en.wikipedia.org/wiki/Rate_of_return

[7] Hung P., Andrew C., Youngwhan L. A Framework for Stock Prediction. 2009

[8] Theodore B. T., Huseyin I. Support vector machine for regression and applications to financial forecasting. IJCNN2000, 348-353.

[9] Robert J.Y, Charies X.L. Machine Learning for Stock Selection. The 13th ACM SIGKDD international conference on knowledge discovery and data mining. New York, USA, 2007: 1038-1042

[10] http://en.wikipedia.org/wiki/Linear_regression

[11] http://www.stanford.edu/class/cs229/notes

[12] http://en.wikipedia.org/wiki/Nonlinear_regression

[13] Cooper M. Filter ruies based on price and volume in individual security overreaction. Review of Financial Studies, 1999(2): 901-935