10.1 Trust Regions
VPG allowed us to leverage our robust empirical gradient \(\hat{g}\):
\[ \hat{g} = \frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}|S_{t}, \theta) \ \hat{A}_{t} \]
To update the neural network \(\theta\) policy parameters using SGA:
\[ \theta_{t+1} = \theta_{t} + \alpha \hat{g} \]
- Sampling is poor, preferably we would like a batch form.
- Step size \(\alpha\) is hard to get right.
- Small changes in the parameter space \(\theta\) create drastic changes in the log probabilities of actions.
Suppose you’re climbing a mountain to reach the top:
- \(S\) – Your current location on the mountain.
- \(A_{1,\dots,k}\) – Your next movement decision (e.g., reach left, step up, lean back).
- \(R\) – A reward based on progress upward (higher = better), but falls can yield large negative rewards.
In this setup:


Without a harness, or solo climbing, bold moves may lead to huge falls.
With a harness, you can explore safely — you can test new moves, but you won’t fall too far.
Trust Regions
Consider an infinite MDP.
\[ s_0, a_0, r_1, s_1, a_1, r_2, s_2, a_2, r_3,... \]
1. Defining the Preliminary Notation
We express \(\eta(\pi)\) as the expected discounted reward following a stochastic policy \(\pi\) as:
\[ \eta(\pi) = \mathbb{E}_{s_0, a_0, \dots} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \right] \]
With minor tweaks, we can add a baseline by substituting with the standard advantage notation:
\[ \eta(\pi) = \mathbb{E}_{s_0, a_0, \dots} \left[ \sum_{t=0}^{\infty} \gamma^t A_\pi(s, a) \right] \]
The following useful identity expresses the expected return of another policy \(\tilde{\pi}\) in terms of the advantage over \(\pi\), accumulated over timesteps:
\[ \eta(\tilde{\pi}) = \eta(\pi) + \mathbb{E}_{s_0, a_0, \dots \sim \tilde{\pi}} \left[ \sum_{t=0}^{\infty} \gamma^t A_\pi(s_t, a_t) \right] \]
Additionally, let \(\rho_{\pi}(s)\) be the discounted visitation frequencies following \(\pi\):
\[ \rho_{\pi}(s) = P(s_{0}) + \gamma P(s_{1}) + \gamma^{2} P(s_{2}) + ... \]
Now we can rewrite \(\eta(\tilde{\pi})\) equation to sum over states instead of timesteps by introducing \(\rho_{\pi}(s)\):
\[ \eta(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\tilde{\pi}}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a) \]
This equation tells us something important:
If we update our policy from \(\pi\) to \(\tilde{\pi}\) so that, at every state \(s\), the expected advantage is nonnegative (i.e., \(\sum_a \tilde{\pi}(a|s)A_\pi(s, a) \geq 0\)), then the new policy will perform at least as well as the old one. If the expected advantage is strictly positive anywhere, the new policy will do better.
This is the same idea as the Policy Improvement Theorem from Lecture 4.
However, in practice, things are not so simple. For approximate or stochastic policies, there may be some states where the expected advantage is negative due to estimation errors or function approximation. This makes it hard to guarantee improvement everywhere, and direct optimization becomes challenging.
2. Defining a Local Approximation to \(\eta\)
To make things simpler, let’s use an easier approximation for \(\eta\) that doesn’t depend on the new policy’s state visitation frequencies \(\rho_{\tilde{\pi}}(s)\).
We define a local approximation:
\[ L_\pi(\tilde{\pi}) = \eta(\pi) + \sum_s \rho_{\pi}(s) \sum_a \tilde{\pi}(a|s) A_\pi(s, a) \]
This formula is a shortcut: instead of tracking how the new policy changes which states we visit, we just use the old policy’s frequencies.
Why is this useful? If we use a parameterized policy \(\pi_\theta\), where \(\theta\) are the parameters, this local approximation \(L\) matches the true return \(\eta\) at the current parameters:
\[ L_{\pi_{\theta_0}}(\pi_{\theta_0}) = \eta(\pi_{\theta_0}), \]
and the gradients also match at \(\theta_0\):
\[ \nabla_\theta L_{\pi_{\theta_0}}(\pi_\theta) \Big|_{\theta = \theta_0} = \nabla_\theta \eta(\pi_\theta) \Big|_{\theta = \theta_0}. \]
So, if we take a small step from \(\theta_0\) to a new \(\tilde{\theta}\) that improves \(L\), it will also improve the true return \(\eta\)—at least for small steps.
3. Conservative Policy Iteration
Kakade and Langford’s key insight was to find a guaranteed lower bound on how much your policy can improve after an update.
\[ \eta(\pi_{\text{new}}) \geq L_{\pi_{\text{old}}}(\pi_{\text{new}}) - \frac{2 \epsilon \gamma}{(1 - \gamma)^2} \alpha^2 \]
where
\[ \epsilon = \max_s \left| \mathbb{E}_{a \sim \pi_0(a|s)} \left[ A_\pi(s, a) \right] \right| \]
This means: if you use this formula, you can be sure your new policy will do at least as well as the right-hand side, no matter what.
However, this guarantee only works for a special kind of policy update called a “mixture policy,” which is just a weighted average of your old policy and a new candidate policy:
\[ \pi_{\text{new}}(a|s) = (1 - \alpha)\,\pi_{\text{old}}(a|s) + \alpha\,\pi^{'}(a|s) \]
In other words, you don’t jump all the way to a new policy—you blend a little bit of the new policy into the old one, controlled by \(\alpha\).