Background

The problem domain that I have chosen for this project is to be able to have an Artificial Intelligence be able to play the game of go or any other similar problem that requires the AI to search through a large search space to be able to figure out the best solution to the problem. For this project, I would like to try and implement a version of Monte Carlo tree search such that I can create an AI that is able to efficiently solve a board state of the game of go. I would also like to test my application against a simpler algorithm such as MiniMax to compare how much quicker it is able to solve a position and against a stronger AI as well to see the difference between using it by itself as well with other algorithms that would help the tree search to find even better moves.

Go is a two-player board game in which the aim is to surround more territory than the opponent. The game of go is usually played on a board that is 19x19 in size but can also be played on different sized smaller boards such as 9x9 or 13x13. For a 19x19 board, there is approximately 250 moves each turn that either player can play. If a game continues for 150 turns approximate turn count), then there would be 250 150 possible moves. Given the possible moves that each player can make per turn, it quickly becomes too much to be able to search through. If we use a game tree to describe how a normal game of go could develop, we would use the 250 possible moves as the branching factor(how many child nodes each current state has), and use the formula b d where b is the branching factor and d is the required depth(how many moves deep we are in to the search) to find how many leaf nodes there are or how many different possible board states there could be.

Basics of Monte Carlo Tree Search

In 2006, RĂ©mi Coulom described the application of the Monte Carlo method, which uses random sampling for deterministic problems which are difficult to solve with other approaches, to game-tree search and coined the term Monte Carlo tree search. Before this was used, go playing programs used different techniques that were more similar to those used in games such as chess but are less suitable to the more complex game of go. Alpha-go was then introduced which uses two different neural networks which were able to beat world Go champion Lee Sedol in 2016 in South Korea with a lead of 4 games to 1. It was first introduced in October of 2015 where it played its first match against European champion Fan Hui, with a score of 5-0. The documentary covering this story can be found here. Alpha go first learned how to play go by learning from amateur games so that it knew how human players thought about the game through supervised learning. Alpha go was then retrained from games of self play through reinforcement learning which lead to the creation of Alpha zero. The number of possible games in go increases exponentially as the game goes on. For example, on move 1 there are 250 possible moves, by move 2 there are 62,500 and by move 10 there are 953,674,316,406,250,000,000,000 possible moves.

Q Learning

Q learning is a model free reinforcement learning algorithm to learn the value of an action in a particular state. Q learning can identify an optimal action selection policy for any give Markov decision process, given enough exploration time and a partly random policy. When Q-learning is performed we create what's called a q-table or matrix that follows the shape of [state, action] and we initialise the values to zero. We then update and store the q-values after an episode. An episode is when you complete an environment once. This table becomes a reference table for the agent to select the best action based on the q-value. The next step is then for the agent to interact with the environment and make updates to the state action pairs in the q table Q[state, action]. The agent interacts with the environment using exploration and exploitation to make sure that it is exploring the environment but also when it finds a good action it uses it properly. Updating the q-table occurs after each action until a given episode is done. The final state for the game of go is when either both players decide to pass or when there is no legal moves to play, I have decided to use the no legal moves as a cutoff as well as a hard cutoff after a certain number of moves to simulate both players passing at the end of a game.

PPO

PPO is a policy gradient method with either discrete or continuous action spaces. It trains a stochastic policy in an on-policy way. The idea behind PPO is to avoid having too large of a policy update. To do that, a ratio is used that tells us the difference between the new policy and the old policy and this ration is usually between 0.8 and 1.2. It also introduces a new way of training an agent by running K epochs of gradient descent over a sampling mini batches.