10.2 Monotonic Improvement

How can we be sure each policy update makes things better, not worse? 📈

We defined a local approximation to \(\eta\), the expected return of another policy \(\tilde{\pi}\) in terms of the advantage over \(\pi\)

\[ L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a) \]

This approximation helped us find a lower bound of how much \(\pi\) could improve after an update:

\[ \eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{2 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2 \]

where

\[ \epsilon = \max_s \left| \mathbb{E}_{a \sim \pi_0(a|s)} \left[ A_\pi(s, a) \right] \right| \]

However, conservative policy iteration only works on a mixture of policies.

\[ \pi_{\text{new}}(a|s) = (1 - \alpha)\,\pi_{\text{old}}(a|s) + \alpha\,\pi^{'}(a|s) \]

WarningProblem

How can we apply this local bound found by Kakade and Langford without having to rely on a mixture of policies?

Monotonic Improvement

To guarantee monotonic improvement, we need to find a way to extend conservative policy iteration to general stochastic policies rather than mixture policies.

To do this we replace \(\alpha\) with a distance measure between \(\pi\) and \(\tilde{\pi}\), and changing the constant \(\epsilon\) appropriately.

1. Bounding with KL Divergence

The distance measure used is KL Divergence for discrete probability distributions (as illustrated in the demo above):

\[ D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_i - q_i| \]

Particularly, to solve the issue of the mixture policies we introduce the distance measure as the total variation divergence between two policies:

\[ D_{TV}(\pi||\tilde{\pi}) = \text{max}_s D_{TV}(\pi(\cdot | s) || \tilde{\pi}(\cdot | s)) \]

TipMonotonic Improvement Theorem

Let \(\alpha = D^{\text{max}}_{TV}(\pi||\tilde{\pi})\). Then the following bound holds:

\[ \eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2 \]

where

\[ \epsilon = \max_{s,a} \left| A_\pi(s, a) \right| \]

2. Surrogate Objectives

All that is left to do now is to maximize our surrogate objective \(L_{\theta_{\text{old}}}\) subject to the trust region constraint:

\[ D_{TV}(\pi_{\theta_{\text{old}}}(\cdot | s) || \pi_{\theta}(\cdot | s)) \leq \delta \]

This leads to the following optimization problem:

\[ \begin{equation} \begin{aligned} \max_{\theta} \quad & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, \, a \sim q} \left[ \frac{\pi_{\theta}(a|s)}{q(a|s)} Q_{\theta_{\text{old}}}(s,a) \right] \\ \text{subject to} \quad & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}} \left[ D_{\text{KL}} \!\left( \pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_{\theta}(\cdot|s) \right) \right] \leq \delta. \end{aligned} \end{equation} \]

The procedure for solving this optimization can be summarized as follows:

  1. Collect data: Sample state–action pairs \((s,a)\) together with Monte Carlo estimates of their Q-values.
  2. Estimate the objective: Use these samples to form empirical estimates of the surrogate objective.
  3. Optimize under constraints: Approximately solve the constrained optimization problem to update the policy parameters \(\theta\), typically using the conjugate gradient method.