Understanding AlphaZero: A Basic Chess Neural Network
Initially AlphaZero was something of a mystery to me. Like everyone else, I knew it made use of a neural network, but to me that didn't mean much. Add in the fact that it learned from selfplay and you might as well have called it magic.
But since it was humans that designed AlphaZero, we shouldn't despair of trying to understand it. This post won't be about the full AlphaZero design (that's a lot to handle at once); instead we'll start with the simplest of neural-network designs.
There are two core questions a chess engine tries to answer:
- How good is this position (an evaluation)?
- What is the best move in this position?
This just means that a neural network is given a chess position, and is designed to output a move and an evaluation. Math-folk might recognize that what we require is a function with a domain of chess positions, and a range of legal moves and evaluations.
Obviously, a physical chessboard can't be used in an engine. For that reason we represent positions as bitboards, which I explained in a previous post. Those are what goes 'in' to the network.
In yet another post (gosh, if you haven't been following my blog this could get confusing!) I calculated that there are 1882 possible moves in chess, depending on the counting method. So our neural network should give 1883 outputs- one for each move, plus one for an evaluation.
I must emphasize: what follows is not how AlphaZero, Lc0, or any other chess network is actually designed. This is just the super simple, probably-not-very-effective way one might do it.
And now, the pictures. The first one is just the overall structure, with no information in it yet:
The "Input Layer", is, of course, where the position is inputted into the network. Since we're using bitboards you can put them in as 64x12 stacks (8x8 squares, times 12 pieces) or as a long string of 1s and 0s (12*64=764 inputs) The second way is how I do it here. So let's put those in for a random position!
Moving downward from the inputs, we next encounter a bunch of rows of purple pentagons which are called Hidden Layers, because they are neither inputs nor outputs and don't have much chess-specific meaning. Each purple pentagon initially doesn't have a value, but will be assigned a numeric value in evaluating the network.
This step- calculating the pentagon values- is where things get calculation-heavy ("math-tastic"). We work down the network, calculating each hidden layer in order. To do this we need network weights, which we obtained from training our network. (That's another topic though, so for now I'll just assume we have them.) Often those weights are pictured as lines of various thickness, so that's what I'll do too.
[For those curious, the calculation is traditionally a scalar product of the values in the layer above, and the relevant weights:
Where vx.n is the value assigned to the xth node on layer n, m is the number of nodes in a layer, and w1→x,n-1 is the weight to node x from node 1 on the previous layer.]
Having one value (shown by coloring a pentagon gold) in the first hidden layer is nice, but useless on its own. We need to fill in the rest of Hidden Layer 1! We can do this in any order, but each time we do it we need to use a different set of weights (represented by a new set of line thicknesses). Here's me finishing Hidden Layer 1:
Now that all the Hidden Layer 1 pentagons are gold, we can start working on Hidden Layer 2! Instead of using the input layer again, we're going to use our new Hidden Layer 1 values (along with more weights) to calculate a single value Hidden Layer 2:
There's not really anything special about the next picture- we just need to continue the process and calculate values for each location in the output layer. But since these next values have chess meaning (unlike the hidden layer ones), I'm going to finish by giving example values rather than just coloring them gold:
Success! We have a position evaluation (white will win 80% of the time) and a ranking of move strengths (higher is better here). Of the moves shown, the network first chooses a1c3, then a1b2, and thinks h7h8N is terrible. The funny part is, the network outputs values for moves even if they're illegal. Your engine must separately determine which of these moves is actually possible to play.
What are the differences between the example I've given and the AlphaZero/Lc0 networks? There are a large number, but these might be the most important:
- They are much (much!) bigger.
- They are 3-dimensional rather than flat
- They include position history in the network evaluation (residuals).
- They have two 'heads' (policy and value) as part of their large-scale structure.
- They are used for tree search, rather than just evaluation.
- The math is more complicated- they have biases, and use tanh for some reason.
And... that's kind of it! Hopefully this all made sense. Someday I'll make a post about why this method actually works. Make sure to follow my blog, and let me know what you think with a comment!