11.1 Monte Carlo Tree Search (MCTS)

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Model-based Reinforcement Learning

Problem:

Agents need to sample many environment interactions to learn environment dynamics.

\[ P(s^{'}, r| s, a) \]

Exploration is blind without a model of environment dynamics. Model-free methods focus on immediate rewards.

Solution:

Agents can plan future rewards by leveraging environment dynamics. Safe exploration - informed exploration.

Problem:

Require an algorithm that demonstrates strong empirical performance and effectively plans based on the dynamics of an environment.

Solution:

Université Charles de Gaulle 2006
Link to Research Paper

Similar to UCB bandit problems, we select actions according to a confidence interval that balances exploration and exploitation, with the formula:

\[A_t = \arg\max_a \left[ Q(s,a) + C \sqrt{\frac{\ln(N(s)_{\text{parent}})}{N(s)}} \right]\]

After selecting an action (A_t), if the corresponding child node does not exist, we expand the search tree by creating a new node for the resulting state (S^{A_t}_{t+1}).

From the expanded node, we simulate n-times acting randomly and calculate the average rewards for all simulations:

\[\bar{R} = \frac{1}{n} \sum_{j=1}^{n} R_j\]

After obtaining the average rewards, we update the values from the current node up to the root node:

\[ N(s) = N(s) + 1 \]

\[ Q(s,a) = Q(s,a) + \frac{\bar{R}}{N(s)} \]

How does the Upper Confidence Bound for Trees (UCT) algorithm balance exploration and exploitation in MCTS?

\[ A_t = \arg\max_a \left[ Q(s,a) + C \sqrt{\frac{\ln(N(s)_{\text{parent}})}{N(s)}} \right] \]