Predicting Tennis Match Outcomes
Through Classification
Shuyang Fang '14 sfang@cs.dartmouth.edu
Computer
Science 074/174: Machine Learning and Statistical Data Analysis - 2012S
The governing body of men's
professional tennis is the Association of Tennis Professionals or
ATP for short. It organizes the year-to-year tournaments and other events that
male professionals participate in. It is perhaps most notable for publishing a
weekly ranking of players, the ATP rankings. A player's ranking is
based on the total points he accrued in a specific set of tournaments within
the last 52 weeks, more points equates to a higher ranking (standard ranking
system, where #1 is the top rank, #2 is second, and so on). [1]
The question of who wins a
given tennis match is of both recreational (fans) and professional (Vegas) interest.
Conventional wisdom suggests that given two tennis players, the one with a higher
ranking will win more often than not. This may be true; however, tennis is a
game that revolves around matchups. There are a multitude of factors that play
into a match: surface, play styles, and even the players themselves are just a
few examples. It goes without saying that the combination of these factors with
player ranking serves as a far stronger predictor of match outcome than ranking
alone.
The goal of this project is
to instruct the computer to predict the winner of a tennis match with greater
accuracy than both random guessing and ranking comparison.
This is inherently a classification problem. A binary categorization of tennis matches would work perfectly in this case. One set would comprise of matches which the first player won, and the other set would be the complement, i.e. first player lost. The set of features would include a large number of variables that factor into a tennis match, easily 20+. Consequently, supervised, statistical classification machine learning algorithms would serve well here. Likely candidates are
1.
k-nearest
neighbor (laziest, would simply classify by similarity)
2.
Adaboost
(probably ideal, synthesizes weak classifiers to produce strong classifier)
3.
relevance
vector machines (probabilistic classification, would also assign odds, implement
only if time allows)
4.
artificial
neural networks (find overall connection) [2]
The core of my
work will be the successful implementation of one or more of these machine
learning algorithms to the problem at hand. From there, I seek to possibly
implement another. Key to this project is evaluation, results should reveal the
efficiency of the implementation in predicting tennis outcomes.
Some prior work
has been done in this field. Largely involving football/basketball/baseball,
sports more in the spotlight. The general classification is effectively the
same though. I will harvest n features from my data sets and then create a n-dimensional
vector representing a matchup. Each dimension will either represent some difference
between the two players' in that feature, or an individual player's feature (in
which case 2n would be more appropriate). Then, the vectors with their classifications
would be fed into the appropriate algorithm. [3][4][5]
I intend to use the following data sets:
1.
http://tennis.wettpoint.com/en/
2.
http://www.tennisexplorer.com/
All offer
in-depth player detail as well as tournament and matchup history.
4/12
- 5/08 (milestone): Successfully implement at least one of the learning algorithms
5/08 - 5/30
(final): Possibly implement another algorithm, demonstrate relative
effectiveness of each, juxtapose with conventional methods of outcome
prediction, compare with algorithm implemented in class
[1] https://en.wikipedia.org/wiki/Tennis
[2] https://en.wikipedia.org/wiki/Machine_learning_algorithms
[3] Hamadani
Babak. Predicting the outcome of NFL games using machine learning. Project report
for Machine Learning, Stanford University.
[4] R.P. Adams,
G.E. Dahl, and I. Murray. Incorporating Side Information in Probabilistic
Matrix Factorization with Gaussian Processes. Proceedings of the 26th Conference on Uncertainty in
Artificial Intelligence, 2010.
[5] G. Donaker.
Applying Machine Learning to MLB Prediction & Analysis. Project report for
CS229 - Stanford Learning 2005.