5.2 On-Policy Monte Carlo
Monte Carlo Control
Without a model, state values \(v_{\pi}(s)\) alone are not sufficient.
We must explicitly estimate the value of each action \(q_{\pi}(s,a)\).
Monte Carlo methods for this are similar to state value estimation, focusing on visits to state-action pairs \((s,a)\).
The main advantage of estimating action values \(q_{\pi}(s,a)\) in Monte Carlo methods lies in Control, which refers to finding approximate optimal policies \(\pi_{*}\).
Proceeding with the idea of Generalized Policy Iteration (GPI), we evaluate and improve action values \(q_{\pi}(s,a)\) to find optimal policies.
Exploring Starts
The only problem of estimating action values \(q_{\pi}(s,a)\) is that some state action pairs \((s,a)\) may never be visited during an episode.
Which brings us back to the same dilemma we faced in the Multi-Armed Bandit chapter, that is:
Balancing exploration and exploitation.
One “quick-fix” is to start each episode from a random state \(s\) and take any action \(a\) with probability greater than \(0\).
This “quick-fix” is referred to as Exploring Starts.