CS174 Project Milestone

Smart NBA Outcome Prediction System

Fanghao Chen


    Introduction:

    NBA is regarded as one of core constituents of modern professional sports with around $4.1 billion dollars of revenue in the 2010-2011 season [1]. Meanwhile, professional betting, which is another billion dollar industry, greatly depends on the accurate game outcome prediction [2]. The goal of this project is implementing Machine Learning algorithms to predict outcome of a game/series by given two NBA teams' statistics.


    Dataset:

    The NBA dataset is downloaded from www.databasebasketball.com. The raw data contains the year-long NBA statistics of players, teams and coaches for both regular seasons and playoffs. I only use data of seasons 2008-2009 through 2009-2010 due to the consistency, since whole NBA league kept changing because of old team quitting and new team joining since 2000s.


    In this project, the regular season team stats are used as X. There are 32 features in total including offensive and defensive statistics, pace and wins of that specific season. Intuitively, a team's offensive stats are the stats "earned" from the opponents, while a team's defensive stats are the stats "earned" by the opponents. In addition, I found several dominant features of game accuracy prediction, such as wins for each team, defensive assists, defensive points, defensive field goals made and offensive rebound in previous season.

    Table 1. Top 5 dominant features of team statistics.
    Features wins d_ast d_pts d_fgm o_reb
    Accuracy 0.6939 0.6293 0.6199 0.613 0.611

    Y is the series outcome of two teams. For example, Boston Celtics plays two games against Los Angeles Lakers in their series. If Celtics wins two games, the series outcome of Celtics is 1, the opposite is 0. However, if they draw, the series outcome will be determined by the most dominant feature, i.e. wins for each team in previous season. The series outcome will be 1 of the team with higher wins rate. Due to lack of the accumulative season data from the database, it is only possible to use series outcome instead of game outcome as Y. In other words, it is impossible using same X data to predict game results of 1 win and 1 lose. That's a limitation of the dataset which I will talk in the later part.


    Figure 1. Training and test set of the prediction system.

    Algorithms:

    In order to get a better understanding of the prediction accuracy, I examined several related work of game outcome prediction. Michael et al. [3] reported up to 73% accuracy to predict NBA games when using linear regression. Hamadani [4] used logistic regression to predict NFL games with accuracy of 64.8%. Radha-Krishna [5] predicted soccer matches with accuracy of 65.5% when using neural networks. In this project, I plan to implement 5 machine learning binary classification methods.


    Results:

    I provided the prediction results of the four implemented algorithms. From the table below, all methods achieve a very satisfied series prediction accuracy except linear regression. However, the game prediction accuracy drops a lot due to the collected data limitation. Overall, the AdaBoost seems to be the best classification algorithms in both game and series prediction.



    Table 2. Accuracy comparison on test set.
    Algorithms Linear Logistic AdaBoost KNN
    Games Prediction Accuracy 0.661 0.6797 0.7016 0.6894
    Series Prediction Accuracy 0.5655 0.8368 0.8966 0.8506

    In this project I assume that if a team wins a series, it wins all the games in that series. Clearly, a bad case is the draw in series which can only give me a fixed 50% game prediction accuracy. Therefore, I tried to use KNN to classify the series win/lose/draw results instead of the original win/lose results. Still if a team wins or loses a series, it wins or loses all the games. But if the team draws in a series, it wins and loses equal number of games. The below figure shows the result of this proposed series model instead of the binary classification model. Although the games prediction accuracy increased a little, the series accuracy drops approximately 60% compared to the win/lose model. Because I haven't implemented Neural Networks yet, it is hard to determine whether use this win/lose/draw model or win/lose model to predict game outcomes at this time.



    Limitations:


    Future Work:


    Reference:

    [1] Sports Industry Overview
    [2] McMurray, S. (1995). Basketball's new high-tech guru. U.S. News and World Report, December 11, 1995, pp 79 - 80.
    [3] Michael Papamichael, Matthew Beckler and Hongfei Wang, NBA Oracle, 2009.
    [4] Babak, Hamadani. Predicting the outcome of NFL games using machine learning. Project Report for CS229, Stanford University.
    [5] Balla, Radha-Krishna. Soccer Match Result Prediction using Neural Networks. Project report for CS534.
    [6] Alan McCabe, An Artificially Intelligent Sports Tipper, in Proceedings : 15th Australian Joint Conference on Artificial Intelligence, 2002.
    [7] Yoav Freund, Robert E. Schapire. "A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting", 1995.
    [8] Paul A. Viola, Michael J. Jones, "Robust Real-Time Face Detection", ICCV 2001, Vol. 2, pp. 747.