10.2 Monotonic Improvement

How can we be sure each policy update makes things better, not worse? 📈

We defined a local approximation to \(\eta\), the expected return of another policy \(\tilde{\pi}\) in terms of the advantage over \(\pi\)

\[ L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a) \]

This approximation helped us find a lower bound of how much \(\pi\) could improve after an update:

\[ \eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{2 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2 \]

where

\[ \epsilon = \max_s \left| \mathbb{E}_{a \sim \pi_0(a|s)} \left[ A_\pi(s, a) \right] \right| \]

However, conservative policy iteration only works on a mixture of policies.

\[ \pi_{\text{new}}(a|s) = (1 - \alpha)\,\pi_{\text{old}}(a|s) + \alpha\,\pi^{'}(a|s) \]

Problem

How can we apply this local bound found by Kakade and Langford without having to rely on a mixture of policies?

viewof p_old = Inputs.range([0, 1], {step: 0.01, value: 0.5, label: tex`\pi_{\text{old}}`, width: 200})
viewof p_new = Inputs.range([0, 1], {step: 0.01, value: 0.7, label: tex`\pi_{\text{new}}`, width: 200})
viewof delta = Inputs.range([0, 1], {step: 0.01, value: 0.1, label: tex`\delta`, width: 200})

// Compute probabilities
policy_old = [
  {outcome: "0", probability: 1 - p_old},
  {outcome: "1", probability: p_old}
]

policy_new = [
  {outcome: "0", probability: 1 - p_new},
  {outcome: "1", probability: p_new}
]

// KL Divergence D_KL(π_new || π_old)
kl = (p_new > 0 && p_old > 0 && p_new < 1 && p_old < 1)
  ? (p_new * Math.log(p_new/p_old) + (1-p_new) * Math.log((1-p_new)/(1-p_old)))
  : NaN

// Check if KL is within TRPO bound
within_bound = !isNaN(kl) && kl <= delta

// Plot both distributions
Plot.plot({
  style: "overflow: visible; display: block; margin: 0 auto;",
  width: 600,
  height: 400,
  y: {
    grid: true,
    label: "Probability",
    domain: [0, 1]
  },
  x: {
    label: "Outcome",
    padding: 0.2
  },
  marks: [
    Plot.barY(policy_old, {x: "outcome", y: "probability", fill: "steelblue", opacity: 0.6}),
    Plot.barY(policy_new, {x: "outcome", y: "probability", fill: within_bound ? "orange" : "red", opacity: 0.6}),
    Plot.ruleY([0])
  ]
})

html`<div style="text-align: center; margin-top: 1em;">
  ${tex.block`D_{\text{KL}}(\pi_{\text{new}} \,||\, \pi_{\text{old}}) 
    = \pi_{\text{new}} \log \frac{\pi_{\text{new}}}{\pi_{\text{old}}} 
    + (1-\pi_{\text{new}}) \log \frac{1-\pi_{\text{new}}}{1-\pi_{\text{old}}}`}
  ${tex.block`= (${p_new.toFixed(2)}) \log \frac{${p_new.toFixed(2)}}{${p_old.toFixed(2)}} 
    + (1-${p_new.toFixed(2)}) \log \frac{${(1-p_new).toFixed(2)}}{${(1-p_old).toFixed(2)}}`}
  ${tex.block`= ${isNaN(kl) ? "\\text{undefined}" : kl.toFixed(3)}`}
  <p style="color: ${within_bound ? "green" : "red"};">
    ${within_bound ? "✅ Within trust region — update allowed" : "❌ Exceeds trust region — update rejected"}
  </p>
</div>`

Solution

Monotonic Improvement

To guarantee monotonic improvement, we need to find a way to extend conservative policy iteration to general stochastic policies rather than mixture policies.

To do this we replace \(\alpha\) with a distance measure between \(\pi\) and \(\tilde{\pi}\), and changing the constant \(\epsilon\) appropriately.

1. Bounding with KL Divergence

The distance measure used is KL Divergence for discrete probability distributions (as illustrated in the demo above):

\[ D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_i - q_i| \]

Particularly, to solve the issue of the mixture policies we introduce the distance measure as the total variation divergence between two policies:

\[ D_{TV}(\pi||\tilde{\pi}) = \text{max}_s D_{TV}(\pi(\cdot | s) || \tilde{\pi}(\cdot | s)) \]

Monotonic Improvement Theorem

Let \(\alpha = D^{\text{max}}_{TV}(\pi||\tilde{\pi})\). Then the following bound holds:

\[ \eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2 \]

where

\[ \epsilon = \max_{s,a} \left| A_\pi(s, a) \right| \]

2. Surrogate Objectives

All that is left to do now is to maximize our surrogate objective \(L_{\theta_{\text{old}}}\) subject to the trust region constraint:

\[ D_{TV}(\pi_{\theta_{\text{old}}}(\cdot | s) || \pi_{\theta}(\cdot | s)) \leq \delta \]

This leads to the following optimization problem:

\[ \begin{equation} \begin{aligned} \max_{\theta} \quad & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, \, a \sim q} \left[ \frac{\pi_{\theta}(a|s)}{q(a|s)} Q_{\theta_{\text{old}}}(s,a) \right] \\ \text{subject to} \quad & \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}} \left[ D_{\text{KL}} \!\left( \pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_{\theta}(\cdot|s) \right) \right] \leq \delta. \end{aligned} \end{equation} \]

The procedure for solving this optimization can be summarized as follows:

Collect data: Sample state–action pairs \((s,a)\) together with Monte Carlo estimates of their Q-values.
Estimate the objective: Use these samples to form empirical estimates of the surrogate objective.
Optimize under constraints: Approximately solve the constrained optimization problem to update the policy parameters \(\theta\), typically using the conjugate gradient method.