11.1 Monte Carlo Tree Search (MCTS)
Model-based Reinforcement Learning
Problem:
Agents need to sample many environment interactions to learn environment dynamics.
\[ P(s^{'}, r| s, a) \]
Exploration is blind without a model of environment dynamics. Model-free methods focus on immediate rewards.
Solution:
Agents can plan future rewards by leveraging environment dynamics. Safe exploration - informed exploration.
Model-based Reinforcement Learning: Illustration
MCTS: Motivation
Problem:
Require an algorithm that demonstrates strong empirical performance and effectively plans based on the dynamics of an environment.
Solution:
Université Charles de Gaulle 2006
Link to Research Paper
MCTS: Illustration
Similar to UCB bandit problems, we select actions according to a confidence interval that balances exploration and exploitation, with the formula:
\[A_t = \arg\max_a \left[ Q(s,a) + C \sqrt{\frac{\ln(N(s)_{\text{parent}})}{N(s)}} \right]\]
MCTS: Expansion
- After selecting an action (A_t), if the corresponding child node does not exist, we expand the search tree by creating a new node for the resulting state (S^{A_t}_{t+1}).
MCTS: Simulation
From the expanded node, we simulate n-times acting randomly and calculate the average rewards for all simulations:
\[\bar{R} = \frac{1}{n} \sum_{j=1}^{n} R_j\]
MCTS: Backpropagation
- After obtaining the average rewards, we update the values from the current node up to the root node:
\[ N(s) = N(s) + 1 \]
\[ Q(s,a) = Q(s,a) + \frac{\bar{R}}{N(s)} \]
MCTS: Summary Illustration
Pseudocode
Exercise
How does the Upper Confidence Bound for Trees (UCT) algorithm balance exploration and exploitation in MCTS?
\[ A_t = \arg\max_a \left[ Q(s,a) + C \sqrt{\frac{\ln(N(s)_{\text{parent}})}{N(s)}} \right] \]