How can we be sure each policy update makes things better, not worse? 📈
We defined a local approximation to \(\eta\) , the expected return of another policy \(\tilde{\pi}\) in terms of the advantage over \(\pi\)
\[
L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a)
\]
This approximation helped us find a lower bound of how much \(\pi\) could improve after an update:
\[
\eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{2 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2
\]
where
\[
\epsilon = \max_s \left| \mathbb{E}_{a \sim \pi_0(a|s)} \left[ A_\pi(s, a) \right] \right|
\]
However, conservative policy iteration only works on a mixture of policies.
\[
\pi_{\text{new}}(a|s) = (1 - \alpha)\,\pi_{\text{old}}(a|s) + \alpha\,\pi^{'}(a|s)
\]
How can we apply this local bound found by Kakade and Langford without having to rely on a mixture of policies?
viewof p_old = Inputs. range ([0 , 1 ], {step : 0.01 , value : 0.5 , label : tex ` \p i_{ \t ext{old}}` , width : 200 })
viewof p_new = Inputs. range ([0 , 1 ], {step : 0.01 , value : 0.7 , label : tex ` \p i_{ \t ext{new}}` , width : 200 })
viewof delta = Inputs. range ([0 , 1 ], {step : 0.01 , value : 0.1 , label : tex ` \d elta` , width : 200 })
// Compute probabilities
policy_old = [
{outcome : "0" , probability : 1 - p_old},
{outcome : "1" , probability : p_old}
]
policy_new = [
{outcome : "0" , probability : 1 - p_new},
{outcome : "1" , probability : p_new}
]
// KL Divergence D_KL(π_new || π_old)
kl = (p_new > 0 && p_old > 0 && p_new < 1 && p_old < 1 )
? (p_new * Math . log (p_new/ p_old) + (1 - p_new) * Math . log ((1 - p_new)/ (1 - p_old)))
: NaN
// Check if KL is within TRPO bound
within_bound = ! isNaN (kl) && kl <= delta
// Plot both distributions
Plot. plot ({
style : "overflow: visible; display: block; margin: 0 auto;" ,
width : 600 ,
height : 400 ,
y : {
grid : true ,
label : "Probability" ,
domain : [0 , 1 ]
},
x : {
label : "Outcome" ,
padding : 0.2
},
marks : [
Plot. barY (policy_old, {x : "outcome" , y : "probability" , fill : "steelblue" , opacity : 0.6 }),
Plot. barY (policy_new, {x : "outcome" , y : "probability" , fill : within_bound ? "orange" : "red" , opacity : 0.6 }),
Plot. ruleY ([0 ])
]
})
html `<div style="text-align: center; margin-top: 1em;">
${ tex. block `D_{ \t ext{KL}}( \p i_{ \t ext{new}} \, || \, \p i_{ \t ext{old}})
= \p i_{ \t ext{new}} \l og \f rac{ \p i_{ \t ext{new}}}{ \p i_{ \t ext{old}}}
+ (1- \p i_{ \t ext{new}}) \l og \f rac{1- \p i_{ \t ext{new}}}{1- \p i_{ \t ext{old}}}` }
${ tex. block `= ( ${ p_new. toFixed (2 )} ) \l og \f rac{ ${ p_new. toFixed (2 )} }{ ${ p_old. toFixed (2 )} }
+ (1- ${ p_new. toFixed (2 )} ) \l og \f rac{ ${ (1 - p_new). toFixed (2 )} }{ ${ (1 - p_old). toFixed (2 )} }` }
${ tex. block `= ${ isNaN (kl) ? " \\ text{undefined}" : kl. toFixed (3 )} ` }
<p style="color: ${ within_bound ? "green" : "red" } ;">
${ within_bound ? "✅ Within trust region — update allowed" : "❌ Exceeds trust region — update rejected" }
</p>
</div>`
Monotonic Improvement
To guarantee monotonic improvement, we need to find a way to extend conservative policy iteration to general stochastic policies rather than mixture policies.
To do this we replace \(\alpha\) with a distance measure between \(\pi\) and \(\tilde{\pi}\) , and changing the constant \(\epsilon\) appropriately.
1. Bounding with KL Divergence
The distance measure used is KL Divergence for discrete probability distributions (as illustrated in the demo above):
\[
D_{TV}(p||q) = \frac{1}{2} \sum_{i}|p_i - q_i|
\]
Particularly, to solve the issue of the mixture policies we introduce the distance measure as the total variation divergence between two policies:
\[
D_{TV}(\pi||\tilde{\pi}) = \text{max}_s D_{TV}(\pi(\cdot | s) || \tilde{\pi}(\cdot | s))
\]
Let \(\alpha = D^{\text{max}}_{TV}(\pi||\tilde{\pi})\) . Then the following bound holds:
\[
\eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{4 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2
\]
where
\[
\epsilon = \max_{s,a} \left| A_\pi(s, a) \right|
\]
2. Surrogate Objectives
All that is left to do now is to maximize our surrogate objective \(L_{\theta_{\text{old}}}\) subject to the trust region constraint:
\[
D_{TV}(\pi_{\theta_{\text{old}}}(\cdot | s) || \pi_{\theta}(\cdot | s)) \leq \delta
\]
This leads to the following optimization problem:
\[
\begin{equation}
\begin{aligned}
\max_{\theta} \quad &
\mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, \, a \sim q}
\left[
\frac{\pi_{\theta}(a|s)}{q(a|s)} Q_{\theta_{\text{old}}}(s,a)
\right] \\
\text{subject to} \quad &
\mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}}
\left[ D_{\text{KL}} \!\left(
\pi_{\theta_{\text{old}}}(\cdot|s) \,\|\, \pi_{\theta}(\cdot|s)
\right) \right] \leq \delta.
\end{aligned}
\end{equation}
\]
The procedure for solving this optimization can be summarized as follows:
Collect data : Sample state–action pairs \((s,a)\) together with Monte Carlo estimates of their Q-values.
Estimate the objective : Use these samples to form empirical estimates of the surrogate objective.
Optimize under constraints : Approximately solve the constrained optimization problem to update the policy parameters \(\theta\) , typically using the conjugate gradient method.