5.3 On-Policy Monte Carlo
On-Policy learning evaluates or improves the policy that is used to make decisions.
Exploration without an Initial Random State and Action
How can we explore without having to rely on the unrealistic assumption of an initial random state and action?
Recall, \(\epsilon\)-greedy methods for balancing exploration vs. exploitation in Multi-Armed Bandits.
These policies are usually referred to as \(\epsilon\)-soft policies as they require that the probability of every action is non-zero for all states and actions pairs, that is:
\[ \pi(a|s) > 0 \quad \text{for all} \quad s \in S \quad \text{and} \quad a \in A(s) \]
\(\epsilon\)-Greedy
To calculate the probabilities of selecting an action according to the \(\epsilon\)-greedy policy \(\pi(a|s)\), we use the following update rule:
\[ \pi(a|s) \gets \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A(S_{t})|} & \text{if} \quad a = A_{t} \quad \text{(exploitation)} \\ \frac{\epsilon}{|A(S_{t})|} & \text{if} \quad a \neq A_{t} \quad \text{(exploration)} \end{cases} \]