Learning Strategies for the Board Game Risk
Christian Pitera
Introduction:
For this project, I planned to create an application to calculate heuristics for game states in the game of Risk. The obvious end goal for this project is an application that can effectively learn strategies for the game. Very little work has gone into artificial intelligence for the game, and the work that has been done has been with human made strategies [1], or human/machine learned hybrids [2]. No work has gone into allowing the computer to learn heuristics and moves on its own.
Methods:
This project required a form of reinforcement learning. As there is no way to adequately measure the utility of every move taken directly in the game, this project made use of temporal difference learning in order to effectively estimate the utility of each board state [3]. Another problem is that the game state of the game of Risk is intractably large. Since I am dealing with a game that has a theoretically infinite number of states, evaluating based on the state as a whole is worthless. Instead, I split each state into many different parameters, and the program learns a function that maps the parameters to the estimated value received from the temporal difference method. Because the game is so complex, and each parameter interacts with each other, normal regression methods did not seem to be very useful. Instead, a neural network was used to attempt to discover the connections between parameters and the function to map them to the utility of the game state. The choice of moves at every possible moment is based on a greedy algorithm, or a one ply search.
Data Gathering:
The data for this project was generated through self-play. However, since a computer randomly playing itself could lead to games that never seemingly end, the first few games were played by a human against the computer. However, I ran into a snag at this point, as not even a basic strategy could be learned from only a few human interacting games. As such, I allowed self-play, but put a limit to the total number of turns that could be taken. After choosing a winner based on the number of territories owned, the algorithm would apply the learning rule normally. The next question was which states should be used as data for the neural network. Using states at the end of every turn is not very effective because it is not a very representative state. An algorithm learning from these states, for example, would have no evidence that the "reinforcement" phase has even occurred, and would not make proper decisions when faced when said phase. The easy solution, however, is to take a state at the end of each phase. One possible improvement that I could see is to train three different networks, one for each of the three major phases of the game, in case the strategies of each aren't comparable. Currently, each player in the game have their turns saved separately, though they are analyzed all together. As for standardization of the data, a very simple method was used, by scaling down values so that they all fit within the same range, allowing faster convergence for the network. The values that they are scaled by were somewhat arbitrary, but should not affect the output in any way.
Problems:
The strategies generated through the heuristic are still very, very basic. After nearly 100 games of self-play, the algorithm generates strategies equivalent to that of a novice player. The reason behind this, I feel, is that the strategies evolve too slowly. Self-play takes a notoriously long time to create good strategies because of the fact that for the first few games, the algorithms are basically playing randomly. As such, even basic concepts take a while to emerge [4]. It may be better to train the algorithm on programs that already have working strategies at first, so that the "random" state is basically skipped. Afterwards, reinforcement self-play can be used to fine tune the strategies. As briefly mentioned in data collection, choosing when to sample a state is very important. Rather than sample at the end of every phase, if I choose to sample after every change in the game state, I would get a much better picture of how games evolve. This may even make multiple neural networks unnecessary. This would lead to its own problem, however, in that this would be a lot more data per game, and the algorithm is slow. For reinforcement learning like self-play, the heuristic must evolve after every match, or there is no improvement, but training a neural network on 100,000+ states after every game greatly slows the self-play process. One way to handle this may be to find another learning method, but as of now, I am going to stick with the one I have.
References:
1. http://www.unimaas.nl/games/files/bsc/Hahn_Bsc-paper.pdf
2. http://www.cs.cornell.edu/boom/2001sp/choi/473repo.html
3. http://www.scholarpedia.org/article/Temporal_Difference_Learning