Predicting Tennis Match Outcomes

Through Classification

Shuyang Fang '14 sfang@cs.dartmouth.edu

Computer Science 074/174: Machine Learning and Statistical Data Analysis - 2012S

 

Background and Objective

 

The governing body of men's professional tennis is the Association of Tennis Professionals or ATP for short. It organizes the year-to-year tournaments and other events that male professionals participate in. It is perhaps most notable for publishing a weekly ranking of players, the ATP rankings. A player's ranking is based on the total points he accrued in a specific set of tournaments within the last 52 weeks, more points equates to a higher ranking (standard ranking system, where #1 is the top rank, #2 is second, and so on). [1]

The question of who wins a given tennis match is of both recreational (fans) and professional (Vegas) interest. Conventional wisdom suggests that given two tennis players, the one with a higher ranking will win more often than not. This may be true; however, tennis is a game that revolves around matchups. There are a multitude of factors that play into a match: surface, play styles, and even the players themselves are just a few examples. It goes without saying that the combination of these factors with player ranking serves as a far stronger predictor of match outcome than ranking alone.

The goal of this project is to instruct the computer to predict the winner of a tennis match with greater accuracy than both random guessing and ranking comparison.

 

Methods

 

This is inherently a classification problem. A binary categorization of tennis matches would work perfectly in this case. One set would comprise of matches which the first player won, and the other set would be the complement, i.e. first player lost. The set of features would include a large number of variables that factor into a tennis match, easily 20+. Consequently, supervised, statistical classification machine learning algorithms would serve well here. Likely candidates are

1.    k-nearest neighbor (laziest, would simply classify by similarity)

2.    Adaboost (probably ideal, synthesizes weak classifiers to produce strong classifier)

3.    relevance vector machines (probabilistic classification, would also assign odds, implement only if time allows)

4.    artificial neural networks (find overall connection) [2]

The core of my work will be the successful implementation of one or more of these machine learning algorithms to the problem at hand. From there, I seek to possibly implement another. Key to this project is evaluation, results should reveal the efficiency of the implementation in predicting tennis outcomes.

Some prior work has been done in this field. Largely involving football/basketball/baseball, sports more in the spotlight. The general classification is effectively the same though. I will harvest n features from my data sets and then create a n-dimensional vector representing a matchup. Each dimension will either represent some difference between the two players' in that feature, or an individual player's feature (in which case 2n would be more appropriate). Then, the vectors with their classifications would be fed into the appropriate algorithm. [3][4][5]

 

Data Sets

 

I intend to use the following data sets:

1.    http://tennis.wettpoint.com/en/

2.    http://www.tennisexplorer.com/

3.    http://tennis.com/

All offer in-depth player detail as well as tournament and matchup history.

 

Timeline

 

4/12 - 5/08 (milestone): Successfully implement at least one of the learning algorithms

5/08 - 5/30 (final): Possibly implement another algorithm, demonstrate relative effectiveness of each, juxtapose with conventional methods of outcome prediction, compare with algorithm implemented in class

 

References

 

[1] https://en.wikipedia.org/wiki/Tennis

[2] https://en.wikipedia.org/wiki/Machine_learning_algorithms

[3] Hamadani Babak. Predicting the outcome of NFL games using machine learning. Project report for Machine Learning, Stanford University.

[4] R.P. Adams, G.E. Dahl, and I. Murray. Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence, 2010.

[5] G. Donaker. Applying Machine Learning to MLB Prediction & Analysis. Project report for CS229 - Stanford Learning 2005.