5.2 Monte Carlo Exploring Starts

Can randomness at the beginning help us see everything eventually? 🧭

Without a model, state values \(v_{\pi}(s)\) alone are not sufficient. Without a model, we cannot derive action preferences from state values alone.

Recall

\[ v_{\pi}(s) = \mathbb{E_{\pi}}[G_{t}| \ s] \]

\(v_{\pi}(s)\) - how good it is to be in a state \(s\).

To make action selections, we must explicitly estimate the value of each action \(q_{\pi}(s,a)\).

What we are really looking for…

\[ q_{\pi}(s, a) = \mathbb{E_{\pi}}[G_{t}| \ s, \ a] \]

\(q_{\pi}(s,a)\) - how good it is to take action \(a\) given that I am in state \(s\).

Monte Carlo methods are similar to state value estimation, focusing on visits to state-action pairs \((s,a)\).

This is crucial for Control, which refers to finding approximate optimal policies \(\approx \pi_{*}\).

Following the idea of Generalized Policy Iteration (GPI), we evaluate and improve action values \(q_{\pi}(s,a)\) to find optimal policies.

However, one major issue is that of estimating action values \(q_{\pi}(s,a)\) is that some state action pairs \((s,a)\) may never be visited during an episode.

Problem

How can we leverage Monte Carlo’s learning rule to approximate the optimal policy \(\approx \pi_{*}\), while assuring that each state action pair \((s,a)\) is visited?

Solution

This brings us back to the same dilemma we faced in the Multi-Armed Bandit chapter, that is: Balancing exploration and exploitation.

One “quick-fix” is to start each episode from a random state \(s\) and take any action \(a\) with probability greater than \(0\).

This “quick-fix” is referred to as Exploring Starts.

Pseudocode