Harrison Hall and John Sigman
CS 174 Project Proposal
Thayer School of Engineering
Dartmouth College
Harrison.K.Hall.TH, John.B.Sigman.TH [at] Dartmouth.edu
The men's NCAA Basketball Tournament is a 64-team, single elimination tournament held every year that determine's the nation's national champion. Even with 6.45 million brackets filled out on ESPN.com last year[1] the winner still failed to predict 12 of the games correctly[2]. While it is astronomically unlikely that anyone has or will ever picked a perfect bracket, a chance of 1 in 263, it is clear that the current augmented human predictions are not perfect. While some machine learning algorithms exists which are competitive with the brackets of professional sports analysts, these algorithms are designed to take into account only team-level statistics. While single-game, individual player statistics are available[3] current, published approaches tend to not evaluate the the importance of individual players or potential player match-ups. We will explore whether including data about personnel matchu-ps enables improved predictions in the NCAA tournament play by combining these current classifiers with individual player evaluation and player match up classifiers in AdaBoost.
Despite the significant recreational and gambling interests in bracket outcome prediction, there are only few academic investigations of bracket prediction algorithms. This set of algorithms can be segmented several ways, specifically: those that generate rankings using season aggregates and those that compare game-by-game team-level statistics. One of the simplest and least computationally intensive algorithms for generating an ordering on the teams in the NCAA is the Ratings Power Index (RPI)[4]. RPI attempts to calculate the strength of schedule of every team by taking an arbitrarily weighted sum of their at their winning percentage, their opponents average winning percentage, and their opponent's opponents average winning percentage. The higher the value of the sum, the better the team. This metric has clear faults as there are edge cases where winning a game against a poor opponent actually make your rating decrease, however we mention RPI here because it is routinely used by the NCAA to determine tournament seeding. Ordinal logistic regression modeling and expectation (OLRE)[4] is another simple method in that it uses season winning percentage and total point differential combined with some proprietary metrics, namely the strength of schedule and the number of wins against top 30 teams as computed by Jeff Sagarin, all of which are easily read in the fashion of a boxscore, with the conceit that we are treating the proprietary methods as a black box since they appear at no effort to the algorithm. It is worth noting additional classification metrics can be utilized here without significant additional cost. Logistic regression is then used to calculate the probability of winning 0, 1, 2, 3, 4, 5, or 6 games in the tournament for each team. This is assembled into a 64-by-7 table. Since we have a priori knowledge about the number of teams that will have each outcome we can use a maximum likelihood estimator to fit multinomial-poisson homogeneous models to contingency tables[5] such that each column will sum to 32, 16, 8, 4, 2, 1, and 1 respectively while maintaining each row sum at 1. From this the expectation on the number of wins each team will have can be easily calculated and used to judge their probable progression in the tournament. The second class of algorithms, those that do game-by-game comparison of teams are significantly more complex to calculate as information about each matchup must be maintained. The logistic regression/markov chain (LRMC)[6][7][8] model is a prominent example of this style of algorithm and has met with significant success. It uses logistical regression to calculate the transition probabilities in a Markov chain using the basic individual game metrics of location, winner, and score. It is different from other methods in that it attempts to quantify the effect home court advantage has on each team's performance as the NCAA tournament happens at a neutral site, which is significantly divergent from most regular season play. This algorithm has been the most successful tournament predictor routinely outperforming analyst polls, NCAA tournament seeding, RPI, Massey, Sagarin, and Las Vega gambling predictions. Common metrics that are used by some classifiers but are not discussed in depth by this study include the Sagarin[9], Pomeroy[10], and Massey[11][12] ratings. While these are used by the NCAA Tournament Selection Committee and used as black boxes in certain learning models, the data used in the derivative papers was calculated using proprietary methods. Thus we cannot accurately recreate their results and must treat their data as an oracle.
The first step in this process is acquiring the dataset with team and individual statistics. ESPN[] keeps standard player and team statistics, however they are not readily packaged for machine learning tasks so we will write a script to scrape the data from the 2008-2009 (possibly 2000-2001) season through the 2011-2012 season. Once we have both game outcomes and player statistics we will begin developing metrics for comparing the impact and importance of each player on the floor. While our ideal would be a calculation similar to the NBA's +/- impact rating that calculates the points scored by players team minus the points scored by opponent while the player is on the floor, we don't currently have access to second-by-second accounting of scoring and the on-court players. We are investigating simulating a plus-minus rating in the aggregate by using the number of minutes they are on the floor and total points, however we are not sold on this method. We are also looking at comparing each player versus the set of players in his same position on the opposing team, although there is evidence the gross classifications of guard, forward and center will not be enough[15], to see if he is likely to overmatch them and effectively generate more points. We are actively considering other features for win classification at the individual player. We plan to compare the traditional team level classifiers with our individual player-based win classifier for a direct comparison of efficacy. Additionally we are going to implement AdaBoost[13] and use the entire set of trained classifiers as weak classifier input to the boosting algorithm. We suspect that this will be the most effective method of win prediction. We will also plan to use AdaBoost on RPI, OLRE, and LRMC alone to be used both as points for comparison.
We are scraping data from ESPN that gives individual player statistics in addition to game outcome. This data goes back until at least mid-2000s. For Massey, Sagarin and Pomeroy ratings we will scrape their respective sites. We are actively looking for a legal source of plus-minus player rating data. To ensure that our predictions are universalizable, we plan to cross-validate our classification training using the leave-one-out methodology of one year's tournament outcome.
[1] Kristie Chond-Adler. “Espn’s tournament challenge sets bracket record with 6.45 million entries.” ESPN Front Row, March 2012. [2] CincyFan007. “1.” http://games.espn.go.com/tournament-challenge-bracket/en/entry?entryID=970323 [3] ESPN. “NCAA-Men’s College Basketball.” http://espn.go.com/mens-college-basketball/ [4] West, Brady T. “A Simple and Flexible Rating Method for Predicting Success in the NCAA Basketball Tournament: Updated Results from 2007,” Journal of Quantitative Analysis in Sports. Volume 4, Issue 2, April 2008, ISSN (Online) 1559-0410, DOI: 10.2202/1559-0410.1099. [5] Lang, J.B., “Homogeneous Linear Predictor Models for Contingency Tables,” Journal of the American Statistical Association (Theory and Methods), 2005, 100, 121-134. [6] Kvam, P. and J.S. Sokol, “A logistic regression/Markov chain model for NCAA basketball”, Naval Research Logistics, 2006. [7] M. Brown and J. Sokol, “An Improved LRMC Method for NCAA Basketball Prediction,” Journal of Quantitative Analysis in Sports, Volume 6, Number 3, Article 4, May 2010. [8] Brown, M., P. Kvam, G. Nemhauser, and J. Sokol, “Insights from the LRMC Method for NCAA Tournament Prediction,” MIT Sloan Sports Analytics Conference, 2012. [9]Sagarin, Jeff. “Jeff Sagarin'sTM College Basketball Ratings.” http://sagarin.com/sports/cbsend.htm [10]Pomeroy, Ken. “2013 Pomeroy College Basketball Ratings ” http://kenpom.com/ [11] Massey, Kenneth. “Massey Ratings-CB.” http://www.masseyratings.com/rate.php?lg=cb&sub=NCAA [12] K.Massey, “Statistical Models Applied to the Rating of Sports Teams.” Honors project in mathematics, Bluefield College, 1997. [13] Yoav Freund, Robert E Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, Volume 55, Issue 1, August 1997, Pages 119-139, ISSN 0022-0000, 10.1006/jcss.1997.1504. [14] Wikipedia. Plus-Minus. http://en.wikipedia.org/wiki/Plus-minus [15] Alagappan, Muthu. “From 5 to 13: Redefining the Positions in Basketball,” MIT Sloan Sports Analytics Conference, 2012. http://www.sloansportsconference.com/?p=5431