10.2 Proximal Policy Optimization (PPO)

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Mathematical Intuition

TRPO’s Main Drawback

TRPO’s main drawback has to do with the calculation of the Hessian matrix with respect to the KL-Divergence:

\[ \mathbf{H} = \nabla^2 D_{KL}(\pi_{\theta_{t}} \| \pi_{\theta_{t+1}}) \]

Proximal Policy Optimization (PPO)

OpenAI 2017
Link to Research Paper

PPO: Surrogate Objectives

PPO KL-Divergence Penalty:

\[ L^{\text{KL}}(\pi_{\theta_{t+1}}) = \mathbb{E} \left[ \frac{\pi_{\theta_{t+1}}(a|s)}{\pi_{\theta_{t}}(a|s)} \hat{A}_t - \beta D_{KL}(\pi_{\theta_{t}} \| \pi_{\theta_{t+1}}) \right] \]

PPO Clip:

\[ L^{\text{CLIP}}(\pi_{\theta_{t+1}}) = \mathbb{E} \left[ \min \left( \frac{\pi_{\theta_{t+1}}(a|s)}{\pi_{\theta_{t}}(a|s)} \hat{A}_t, \text{clip}\left(\frac{\pi_{\theta_{t+1}}(a|s)}{\pi_{\theta_{t}}(a|s)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right] \]

Gaussian Policy

\[ \begin{aligned} \mu(s), \sigma(s) &= \theta_{\mu, \sigma}(s) \\ \pi(a|s; \theta) &= \frac{1}{\sqrt{2 \pi \sigma^2(s)}} \exp\left(-\frac{(a - \mu(s))^2}{2 \sigma^2(s)}\right) \end{aligned} \]

Proximal Policy Optimization: Illustration

Proximal Policy Optimization: Pseudocode

Exercise

What is the key mathematical difference between the true policy gradient and a surrogate policy gradient in reinforcement learning?

\[ L(\pi_{\theta_{t+1}}) = F(\pi_{\theta_{t}}) + \mathbb{E}_{s \approx \rho_{\pi_{\theta_{t}}}, a \approx \pi_{\theta_{t}}} \left[\frac{\pi_{\theta_{t+1}}(a|s)}{\pi_{\theta_{t}}(a|s)} \hat{A}_t\right] \]