Major League Baseball Performance Prediction: Milestone Report

Jason Reeves
CS 134 Project


Problem Recap

Our goal is to predict the performance of a major league player based on the past performance of "similar" players. In our proposal, we planned to predict a small subset of batting data (batting average, home runs, runs batted in) by using a k-nearest neighbor algorithm to determine the identity of the k most similar players to our predictee, then use the neighbors' data to make our prediction.

We laid out two goals for the project milestone:
  1. Obtain our data, and process it into a database for easy access.
  2. Implement our machine-learning algorithm in our code.
For the most part, we have achieved these goals, though there is still some work left in the way of data verification.

Our Data

Our main source of data comes from the spreadsheets provided by The Baseball Guru[4], containing batting data spanning from 1997 to 2009. When we found the provided data to be incomplete, Baseball-Reference.com[1] was used to fill in the cracks.

Putting the data together in a database took much longer than anticipated, due to the inconsistency and poor setup of the spreadsheets themselves. While the spreadsheets all contained the relevant statistical data, some of the important metadata (years, ages, unique IDs) was not always included. (For example, the 2004 data did not include ages, the 2003 data lacked ages and IDs, and the 2005 data had neither the ages or a column for the year.) In addition, the unique IDs provided appeared to have been pulled from the Baseball-Reference site[1] and could not always be determined from the spreadsheet alone. To work around these issues, we did the following: In each of the above cases, the data was manually examined, and names were altered such that each individual player was referred to within the data by a single variation of their name. (This was done on a case-by-case basis, without a standard criteria. For the most part, player names were changed to the most common variation of their name already in the data, to minimize the number of changes.)

We also encountered issues with full names being placed together (separated only by a comma) in a single column in some spreadsheets, and in the case of our 2000 data, the first names were left out entirely, save for the first initial. The former issue was quickly resolved by splitting the name cells on the comma, while the latter was fixed by matching the names there with the list of names in the 1999 data and using Baseball-Reference[1] to fill in any names not on the older list.

Fortunately, in the process of manually cleaning up our name data, we verified the data being changed with Baseball-Reference[1] and found the statistical data within the spreadsheets to be mostly accurate. Some isolated examples of inconsistencies were discovered (some players had empty entries in years they did not play in, and in one or two instances a statistic line was attributed to the wrong player), but these errors were few and far between.

Finally, we were able to take our cleaned up data and store in into a MySQL database to make it easier to access and work with for making our predictions. We declared the data spanning from 1997-2008 to be our training set, and held out the 2009 data as a test set for our predictions.

Once the data had been transferred into MySQL, we set a threshold for our data to eliminate outliers-in this case, players with limited batting data such as pitchers or players who saw limited action during their careers. Our threshold was set as follows:

To be considered for our algorithm, a player must have averaged at least 100 at-bats per year for every year of their career.

This threshold may be low (according to our 2009 batting data spreadsheet, the average number of at bats per player was 143.4679931 that year), but still removed a fair number of limited-data outliers while allowing for players to be included despite having a few limited-data years due to injuries or other factors.

Our Algorithm: K-Nearest Neighbor

As stated in our proposal, we used a k-nearest neighbor algorithm to determine the players most similar to the one whose performance we are predicting. Normally, k-NN is used as a classification algorithm, where an object is assigned a label based on the majority label of the nearest k data points[3]. Our approach differs from this idea slightly, as our labels are drawn from either the set of positive integers (for home runs(HR) and runs batted in(RBI)), or the set of positive real numbers (for batting average(BA)). Our algorithm works in the following way:
  1. Retrieve player x's year-by-year statistics up to the prediction point.

  2. For every player y in the training set, calculate the year-by-year "difference" between the statistic vectors of x and y up to the prediction point. Note that we compare players via their ages rather than calendar years-for example, we compare what players did at age 25 rather than what they did in 2004.

  3. Average the yearly differences between x and each y. The k nearest neighbors are the players with the smallest average difference from x.

  4. Average the k neighbors' statistics at the prediction point to get the prediction for player x.
To determine the yearly difference between two players, we take the difference between the two player's batting average, home runs, and RBIs, then sum the absolute values of these quantities together to obtain our difference. Currently, in terms of parameter weighting, we multiply the batting average difference by 1000, since these averages are always computed to the thousandth of a point, and we want to assign one-thousandth of a batting average point the same weight as a home run or RBI. No additional weight parameters were given to home runs or RBIs. Therefore, our equation for the yearly difference between two players x and y is the following:

Diff(x, y) = (1000 * |x.BA - y.BA|) + |x.HR - y.HR| + |x.RBI - y.RBI|


If we were to take the year-by-year difference between x and every other player y for every year in x's career, x's nearest neighbors would be the players with the k smallest average differences from x.

Results

We currently have a k-nearest neighbor algorithm written in Java, which will take a single player within our data set, find that player's k nearest neighbors (excluding the player himself), then make a prediction of that player's future performance based on the performance of the nearest neighbors. Currently, we have set aside our 2008 data as our test set within the training data, and our predictor is hard-coded to predict a player's performance for this year. We are, however, able to run our code using a range of values of k, and compare our predictions with a player's actual 2008 statistics to determine our error values. Below are some example graphs plotting the change in the error rate for certain players as k increases:


(Image From Yahoo Sports[5])
Ryan Doumit (C-PIT)

2008 Statistic Predictions
k BA HR RBI
1 .125 0 1
3 .215 2 18
5 .223 5 22
7 .201 3 17
9 .215 5 25
11 .221 4 21
ACTUAL .318 15 69



(Image From Yahoo Sports[5])
Dan Uggla (2B-FLA)

2008 Statistic Predictions
k BA HR RBI
1 .271 23 81
3 .271 35 102
5 .264 28 89
7 .263 27 92
9 .263 25 89
11 .263 26 90
ACTUAL .260 32 92



(Note that because of the unresolved age issues mentioned earlier, these predictions may change as data corrections are made.)

Next Steps

For the remainder of the project, we plan to do the following: References

1. Baseball-Reference.com. Sports Reference LLC, 2000. Web. 12 April 2010. http://www.baseball-reference.com/.
2. Goldberger, Jacob, et al. "Neighbourhood Components Analysis." Conference on Neural Information Processing Systems. 5-10 Dec. 2005, Vancouver, Canada. Cambridge, MA: MIT Press, 2006. Web. CiteSeerX. 12 May 2010. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.6605&rep=rep1&type=pdf.
3. "K-nearest neighbor algorithm." Wikipedia.com . Wikipedia, n.d. Web. 12 April 2010. http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
4. "The Baseball Guru's Private Data Archive." BaseballGuru.com. The Baseball Guru, n.d. Web. 12 April 2010. http://baseballguru.com/bbdata1.html.
5. Yahoo! Sports. Yahoo!, n.d. Web. http://sports.yahoo.com/.