5.4 Off-Policy Monte Carlo

Can we watch others and still learn something for ourselves? 👀

Learning directly from the policy \(\pi\) is powerful, but it would be even better if we could learn about \(\pi\) while observing or following a different policy, especially one that helps us explore more broadly or leverage past experiences.

TipOff-Policy Learning

Off-Policy methods evaluate or improve a policy \(\pi\) different from that used to generate the data \(b\). Typically this is accomplished using two policies:

  • A target policy, denoted \(\pi\), is the policy being learned.

  • A behavior policy, denoted \(b\), is the policy used to generate behavior.

NoteReal Life Example 🧠

“If I have seen further, it is by standing on the shoulders of giants.” — Isaac Newton

Newton is considered one of the greatest scientists of all time. Among his many contributions:

He would not have been able to accomplish these things without learning from the work of others — Kepler, Galileo, Descartes, and more — whose insights formed the foundation for his own breakthroughs.

Kepler - Discovered the laws of planetary motion, showing that planets move in ellipses and not circles.

Galileo - Studied motion and inertia, laying the groundwork for Newton’s first two laws of motion.

Descartes - Developed analytic geometry and early ideas of mechanistic physics, helping bridge mathematics and physical laws.

Just like in off-policy learning, Newton didn’t need to replicate others’ paths \(\pi_{\text{Kepler, Galileo, Descartes}}\) exactly. Instead, he learned from their trajectories \(b_{\text{Kepler, Galileo, Descartes}}\) to improve his own \(\pi_{Newton}\).

WarningProblem

How can we leverage Monte Carlo’s learning rule to approximate the optimal policy \(\approx \pi_{*}\), while leveraging other policies \(b\) apart from our own \(\pi\)?

Importance Sampling is a general technique for estimating expected values under one distribution given samples from another.

We apply importance sampling to off-policy learning by weighting returns according to the relative probability of their trajectories occurring under the target and behavior policies, called the importance-sampling ratio.

\[ \text{Pr}\{A_{t}, S_{t+1}, A_{t+1}, \dots , S_{T} \mid S_{t}, A_{t:T-1} \sim \pi \} = \prod_{k=t}^{T-1} \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]


Incremental Method

Similarly to the Multi-Armed Bandits chapter, action values \(q_{\pi}(s,a)\) can be calculated incrementally.

In order to do so, we must first begin by calculating a cumulative sum of the weights:

\[ C(S_{t},A_{t}) = C(S_{t},A_{t}) + W \]

Then, we average returns of corresponding action values:

\[ Q(S_{t},A_{t}) = Q(S_{t},A_{t}) + \frac{W}{C(S_{t},A_{t})}[G - Q(S_{t},A_{t})] \]

Finally, we update the weight according to our importance sampling ratio:

\[ W = W \frac{\pi(A_{k} \mid S_{k})}{b(A_{k} \mid S_{k})} \]

We can assure Off-Policy methods to achieve control by choosing \(b\) to be \(\epsilon\)-soft.

The target policy \(\pi\) converges to optimal at all encountered states even though actions are selected according to a different soft policy \(b\), which may change between or even within episodes.


Pseudocode