How AlphaZero Works

Recently I posted about the phenomenal performance of the AlphaZero algorithm in computer chess. For the first time in history, an algorithm displayed human-like understanding of chess. AlphaZero seemed to understand what moves were best and spent its time focusing only on them. It didn’t mechanically crunch through millions of possible positions, run out of time, and then select the best move. The best moves emerged from its computer neural network, like a human grandmaster. How? Here’s the paper.

Here’s the TL;DR explanation

  • Input. AlphaZero receives the board position and the known rules of chess – that’s it.
  • Neural network. AlphaZero’s neural network generates a list of pretty good looking moves. THIS IS BLACK MAGIC.
  • Search. AlphaZero then spends most of its time searching pretty much at random through the deep combinations that begin with each of those “good looking” moves.
    • The random search is influenced by trying to get good coverage of all the suggested moves and focusing on moves that are starting to look good.
  • Output. When AlphaZero runs out of time it picks the best looking move.

So This is Crazy

It’s hard to overstate AlphaZero’s performance. It was given just the rules of chess and nine hours to play itself 44 million games, and then it learned something so deep about chess that it crushed the world champion computer chess program, Stockfish, 155 games to 6. (They played 1,000 total games; at this level most games are draws.)

Now first, you might say, wow 44 million games is a lot of chess. And it is. I wonder how good I might be at chess if I was able to play (and somehow remember) that many games. But it’s worth noting the enormous complexity of the game of chess. In the first two moves (white moves then black moves), there are 400 possible board positions. In the second two moves, there are ~200k possible board positions. In the third two moves, there are ~121 million possible board positions. The descent into insanity accelerates from there.

There are more possible games of chess than there are atoms in the universe multiplied by grains of sand on all the world’s beaches. There are so many possible games of chess that we don’t know how many games of chess there are: it’s literally too hard to compute. There are so many games of chess that the next time you play a game, it is quite probably the only time anyone in history has ever played that game and the only time in history anyone will ever play that game.

So, yes, 44 million games is a lot. But it’s also nothing. Like really, you’re not even scratching the surface.

AlphaZero Played Smarter

AlphaZero learned something deep about chess in those 44 million games, and that knowledge, encoded within its neural network, allowed it to play very, very efficiently. AlphaZero did not out-compute Stockfish, at least not in a traditional sense. In fact, AlphaZero evaluated fewer positions per second than Stockfish, crunching through around 60,000 positions per second compared to 60 million per second for Stockfish. Somehow, AlphaZero was focusing on moves that were so much better than Stockfish that it could crush the world champion computer program while operating on 1/1000 of the raw horsepower.

And here’s the crazy part: I’m burying the lede. AlphaZero shattered our understanding of the upper limits of chess performance with just nine hours of self-play. The same algorithm crushed world champion programs and people in the games of Shogi and Go. It is a generalized, intelligent game playing beast the likes of which we have never seen.

The Rough Details of How AlphaZero Works

Essentially AlphaZero has two components: (1) a neural network trained by playing itself chess; and (2) a Monte Carlo Tree Search algorithm to explore the moves suggested by the neural network. You can think of AlphaZero as a deep learning neural network that does an amazing job pruning the search space to focus only on the moves that matter.

What does AlphaZero take as input? AlphaZero takes the board position as input – that’s it. It understands the basic rules of chess, but does not use any domain-specific features or heuristics (e.g., material point values, pawn structure, king safety, etc.).

The input data representation for AlphaZero’s neural network is a set of planes representing the position of each type of piece (separate for each player), duplicated for a history of eight moves. So that’s 8 historical moves x 2 players x 6 kinds of pieces. In addition, a set of constant-valued planes denoting the player’s color, move number, and the state of special rules (e.g., legality of castling) is provided.

For chess playing, that’s 119 total planes provided in an image stack. Below is a table summary from the paper. Note that the 119 total planes is broken down as (6 planes for each of player one’s piece types + 6 planes for each of player two’s piece types + 2 repetition counts) * 8 historical moves + the 7 constant-valued planes = 119.

What does AlphaZero’s neural network output? AlphaZero’s neural network generates (1) a list of possible moves along with the probability that each of those moves will be made in this position; and (2) a simple win-lose-or-draw estimate of the outcome of the game starting from the given board position.

The output data representation is a bit unusual, at least from a human point of view. The list of possible “good moves” is provided in the form of a stack of 73 planes of size 8 x 8 that represent a probability distribution over 4,672 possible moves. These are “possible moves” simply because 8 x 8 x 73 = 4,672, but obviously most of these moves will be illegal under the rules of chess. Illegal move probabilities are set to 0%.

The first 56 planes encode “queen moves,” which are not necessarily moves by a queen piece; they are moves that a queen could make (horizontal, vertical, diagonal). The next 8 planes encode knight moves, and the final 9 planes encode underpromotions such as a pawn being promoted to a rook, bishop, or knight.

The paper doesn’t say expressly, but the “move probabilities” are likely determined simply by the magnitude of the network response thrown into a softmax function. The softmax function is just a way of converting a bunch of values into probabilities. For example, if a move is identified with magnitude 4 and two others are given magnitude 2’s (4, 2, 2), the move probabilities will be assigned such that the 4 is most probable and the 2’s are not, and they all sum to 100%. In this example, the softmax would be 78%, 11%, 11%.

How does AlphaZero choose a move from the output of the neural network? AlphaZero evaluates the move candidates from its neural network roughly in order of probability to find the best possible move from the given position. Each of these candidates may have position trees that extend into billions or trillions of subsequent possible combinations, and AlphaZero doesn’t have time to exhaustively evaluate all possible end results. So it uses a “Monte Carlo” search, which is a fancy way of saying that it searches more or less at random while trying to get good coverage of all the possibilities and then focusing on moves that are starting to look good.

This “sort-of-random” Monte Carlo search is unusual for chess-playing algorithms. Traditionally, computers have used an “alpha-beta” search, which focuses on the best-looking moves. Monte Carlo search strategies have faired poorly, as you might expect, because they spend precious time exploring moves that already seem like bad moves.

So why does the combination of AlphaZero’s neural network and a Monte Carlo search strategy work so well? The researchers have a theory:

AlphaZero evaluates positions non-linearly using deep neural networks, rather than the linear evaluation function used in typical chess programs. This provides a more powerful evaluation function, but may also introduce larger worst-case generalization errors. When combined with alpha-beta search, which computes an explicit minimax, the biggest errors are typically propagated directly to the root of the subtree. By contrast, AlphaZero’s MCTS [Monte Carlo Tree Search] averages over the position evaluations within a subtree, rather than computing the minimax evaluation of that subtree. We speculate that the approximation errors introduced by neural networks therefore tend to cancel out when evaluating a large subtree.

Preprint at 16.

In other words, the neural network may make big mistakes when evaluating some positions. If the searching function focuses on exactly what the neural network is saying, it could end up focusing on a mistake. If the searching function instead takes all the suggestions with a grain of salt and keeps its options open, the mistakes made by the neural network tend to cancel themselves out. This is akin to saying, “I’m going to point you in the general direction, but this is a gut feeling. You’re going to want to check the specifics.”

Sound familiar? To me this sounds like the grandmaster who immediately considers three possible moves in a position and spends the rest of his time double-checking the tactics of making these moves.

How does AlphaZero train its neural network? AlphaZero’s neural network starts with randomized parameters and then learns new parameters by playing games with itself. It plays each game by outputting a list of moves from its neural network (randomly at first) and then using the Monte Carlo search to try to find the best move. It plays the game all the way to completion and then takes the result (win, loss, or draw) and compares that against the predictions it had made during the game. The neural network parameters are updated to minimize the errors it made during play. Neural network parameters are updated by backpropagation.

AlphaZero played itself 44 million times in 9 hours and then crushed anything that had ever come before it. It learned something deep about the game of chess, and it did so without leveraging any human insight.

But AlphaZero can’t tell us what it learned any more than a grandmaster can tell us why a move feels right. That learning is embedded in the millions and millions of parameters of AlphaZero’s neural network. We can only stare at those numbers and hazard guesses about what they mean. How normal will that become? Are we on our way to an “incomprehensible future“?