5.4 Off-Policy Monte Carlo
Off-Policy methods evaluate or improve a policy different from that used to generate the data. Typically this is accomplished using two policies:
A target policy, denoted \(\pi\), is the policy being learned.
A behavior policy, denoted \(b\), is the policy used to generate behavior.
Importance Sampling
Importance Sampling is a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
\[ \text{Pr}\{A_{t}, S_{t+1}, A_{t+1}, \dots , S_{T} \mid S_{t}, A_{t:T-1} \sim \pi \} = \prod_{k=t}^{T-1} \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]
Incremental Implementation
Similarly to the Multi-Armed Bandits chapter, action values \(q_{\pi}(s,a)\) can be calculated incrementally.
In order to do so, we must first begin by calculating a cumulative sum of the weights:
\[ C(S_{t},A_{t}) = C(S_{t},A_{t}) + W \]
Then, we average returns of corresponding action values:
\[ Q(S_{t},A_{t}) = Q(S_{t},A_{t}) + \frac{W}{C(S_{t},A_{t})}[G - Q(S_{t},A_{t})] \]
Finally, we update the weight according to our importance sampling ratio:
\[ W = W \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]
Off-Policy Control
We can assure Off-Policy methods to achieve control by choosing \(b\) to be \(\epsilon\)-soft.
The target policy \(\pi\) converges to optimal at all encountered states even though actions are selected according to a different soft policy \(b\), which may change between or even within episodes.