5.2 Monte Carlo Exploring Starts
Without a model, state values \(v_{\pi}(s)\) alone are not sufficient. Without a model, we cannot derive action preferences from state values alone.
\[ v_{\pi}(s) = \mathbb{E_{\pi}}[G_{t}| \ s] \]
\(v_{\pi}(s)\) - how good it is to be in a state \(s\).
To make action selections, we must explicitly estimate the value of each action \(q_{\pi}(s,a)\).
\[ q_{\pi}(s, a) = \mathbb{E_{\pi}}[G_{t}| \ s, \ a] \]
\(q_{\pi}(s,a)\) - how good it is to take action \(a\) given that I am in state \(s\).
Monte Carlo methods are similar to state value estimation, focusing on visits to state-action pairs \((s,a)\).
This is crucial for Control, which refers to finding approximate optimal policies \(\approx \pi_{*}\).
Following the idea of Generalized Policy Iteration (GPI), we evaluate and improve action values \(q_{\pi}(s,a)\) to find optimal policies.
However, one major issue is that of estimating action values \(q_{\pi}(s,a)\) is that some state action pairs \((s,a)\) may never be visited during an episode.
How can we leverage Monte Carlo’s learning rule to approximate the optimal policy \(\approx \pi_{*}\), while assuring that each state action pair \((s,a)\) is visited?
This brings us back to the same dilemma we faced in the Multi-Armed Bandit chapter, that is: Balancing exploration and exploitation.
One “quick-fix” is to start each episode from a random state \(s\) and take any action \(a\) with probability greater than \(0\).
This “quick-fix” is referred to as Exploring Starts.