MILESTONE Group
Members: Peng Ding and Tao Mao
|
1.What We Have Achieved We have successfully implemented an algorithm of training the computer to play Tic-Tac-Toe against human. After training, human cannot beat the computer as long as the computer goes first. Below is how the game-play looks like in the console interface.
Here is how the algorithm works to update values for possible states based on reinforcement learning[1]. First, we set up a table of these values. Each value is considered as the recent estimation of the probability that a player wins the game from this particular state. The higher a state value is, the more likely this state leads to a final winning for the player. For each action, a player chooses an available position based on the state values indicating the selection weights. Those state values are learned from the previous learning phases. The two players alternately chooses an available position to fill in untill one player wins the game (three "O" or "X" in a line), or the game reaches a draw. In our implementation, Player "O" denotes human player while Player "X" denotes the computer agent. The advantadges for this setting will be explained below. For initialization, we set the value of each possible state V(s) to be 0 and the learning rate The algorithm's framework mainly consists of two parts-learning phase and game-play phase. Here follows the brief description of the algorithm's structure: 1. The learning phase currently involves 100,000 learning times. In each episode: 2. The game-play phase make a "greedy" decision based on the learned state values. Every time the computer is making next move, *These online afterward learning may not be included in the game-play phase as long as the game strategies are considered to be solid. 2. Convergence of "Afterstate" Value Player "X" represents the computer and Player "O" represents human. Player "X" goes first. We will show the convergence of "afterstate" value using the following example. Which position is chosen to be the opening position is very crucial for Tic-Tac-Toe game. The choice of central position results in "no-loss" guarantee, which is easily verified. Therefore, the computer should be able to find the great value of opening in the center. Below is the figure showing the convergence of the values of nine opening positions including the center.
Assume that the computer chooses the center for opening and the game has reached state S1, as shown in Figure 2. Now it is again player "X" 's turn. Obviously, chance of winning the game is greater if player "X" takes the position of 1,3,7 or 9 (Figure 2).
We assume that the computer has been trained enough to choose an action/position leading to a higher state value as, now, Player "X" chooses position 1 shown in Figure 3.
The plot below shows that the afterstate value of state S1 vs. learning time (Figure 3). The value converges to approximately one. It explains that after enough time of learning Player "X" knows winning from state S2 is almost guaranteed.
Then Player "O" chooses position 9 to avoid losing the game (Figure 4). In the next step, a well-learned player "X" (computer) will choose position 3 (Figure 5).
Up to this point, a well-trained (after sufficient learning time) computer, Player "X", will win the game no matter what position Player "O" takes. More specifically, If Player "O" chooses postion 2, Player "X" will takes position 7 to win the game. If Player "O" chooses postion 7, Player "X" will takes position 2 to win the game. If Player "O" chooses postion 4 or position 8, Player "X" will win the game by either taking postion 2 or postion 7.
Figure 6 shows how afterstate values of S3 converge over time. We notice that the true value of the crucial state does not quickly converge to the real value (i.e. one) due to the insufficient learning time. However, it will not prevent the algorithm make a right decision in game-play because the value of the crucial state stands out compared to those of others'. 3. Self-Play Learning Phase As the project proposal states, "self-play training method" is that two computer agents play against each other and learn game strategies from simulated games. This training method has several advantages such as that an agent has general strategies rather than those associated with a fixed opponent. Self-play training method may have a slow convergence rate of state values in some algorithms, especially in the early learning stage [2]. However, it works well in our algorithm because the scale of the problem is not too large. In other words, the computer player will gain fruitful learning experience and update his state values based on various successes and failures even though we initialize the value for each possible state as zero. It will take us much longer time if we train the computer agent ourself by human-computer interaction. On the other hand, we do not bother to train the computer agent ourself since it will take us quite a long time to do so. References [1] R. Sutton and A. Barto. Reinforcement Learning: An Introduction,
MIT Press, Cambridge, MA: pp. 10-15, 156. 1998 |