5.4 Off-Policy Monte Carlo
Learning directly from the policy \(\pi\) is powerful, but it would be even better if we could learn about \(\pi\) while observing or following a different policy, especially one that helps us explore more broadly or leverage past experiences.
Off-Policy methods evaluate or improve a policy \(\pi\) different from that used to generate the data \(b\). Typically this is accomplished using two policies:
A target policy, denoted \(\pi\), is the policy being learned.
A behavior policy, denoted \(b\), is the policy used to generate behavior.

“If I have seen further, it is by standing on the shoulders of giants.” — Isaac Newton
Newton is considered one of the greatest scientists of all time. Among his many contributions:
- 🧲 Formulated the laws of motion and universal gravitation.
- 🌈 Demonstrated that white light is made of a spectrum of colors.
- 📐 Developed calculus (independently of Leibniz).
- 🔭 Improved the telescope and advanced optical theory.
He would not have been able to accomplish these things without learning from the work of others — Kepler, Galileo, Descartes, and more — whose insights formed the foundation for his own breakthroughs.



Just like in off-policy learning, Newton didn’t need to replicate others’ paths \(\pi_{\text{Kepler, Galileo, Descartes}}\) exactly. Instead, he learned from their trajectories \(b_{\text{Kepler, Galileo, Descartes}}\) to improve his own \(\pi_{Newton}\).
How can we leverage Monte Carlo’s learning rule to approximate the optimal policy \(\approx \pi_{*}\), while leveraging other policies \(b\) apart from our own \(\pi\)?
Importance Sampling is a general technique for estimating expected values under one distribution given samples from another.
We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.
\[ \text{Pr}\{A_{t}, S_{t+1}, A_{t+1}, \dots , S_{T} \mid S_{t}, A_{t:T-1} \sim \pi \} = \prod_{k=t}^{T-1} \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]
Incremental Method
Similarly to the Multi-Armed Bandits chapter, action values \(q_{\pi}(s,a)\) can be calculated incrementally.
In order to do so, we must first begin by calculating a cumulative sum of the weights:
\[ C(S_{t},A_{t}) = C(S_{t},A_{t}) + W \]
Then, we average returns of corresponding action values:
\[ Q(S_{t},A_{t}) = Q(S_{t},A_{t}) + \frac{W}{C(S_{t},A_{t})}[G - Q(S_{t},A_{t})] \]
Finally, we update the weight according to our importance sampling ratio:
\[ W = W \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]
We can assure Off-Policy methods to achieve control by choosing \(b\) to be \(\epsilon\)-soft.
The target policy \(\pi\) converges to optimal at all encountered states even though actions are selected according to a different soft policy \(b\), which may change between or even within episodes.