11.2 Advanced Monte Carlo Tree Search

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Motivation

Require an algorithm that leverages neural networks for more efficient MCTS acting and updating.

Policy Network: Guides expansion by prioritizing actions with high probabilities, reducing the search space:

\[ P(a|s) \propto \pi(a|s; \theta) \]

Value Network: Replaces random rollouts with a learned estimate of the value function:
\[ Q(s,a) \approx V(s; \theta) \]

Selecting:
\[ A_t = \arg\max_a \left[ Q(s, a) + C \cdot \pi(a | s; \theta) \frac{\sqrt{\sum_b N(s, b)}}{1 + N(s, a)} \right] \]
Expanding:
If the selected node has unvisited children, expand.
Simulating:
\[ V(s; \theta) = (1 - \lambda) V(s; \theta) + \lambda R \]
Updating:
\[ N(s, a) = N(s, a) + 1 \]

\[ Q(s, a) = Q(s, a) + \frac{1}{N(s, a)} \left( V(s; \theta) - Q(s, a) \right) \]

Results of 5-game tournament against Fan Hui (Elo in 2016: 3036).

AlphaGo won all 5 games.

Selecting:
\[ A_t = \arg\max_a \left[ Q(s,a) + C \frac{\pi(a|s)}{1 + N(s,a)} \right] \]

Using Prediction Network \(f(s)\) to compute \(\pi\) and \(Q\)

Use Representation Network \(h(o)\) to create state
Prediction Network \(f(s)\) generates initial policy/value

Dynamics Network \(g(s,a)\) predicts next state and reward

\[ Q(s,a) = \frac{1}{N(s,a)} \sum_i V_i(s) \]

Orange line indicates best result of AlphaZero (AlphaGo trained purely on self-play).