10.3 Proximal Policy Optimization (PPO)

What if you could move fast — but with guardrails that keep you from tipping over? 🎛️

TRPO’s main drawback has to do with the calculation of the Hessian matrix with respect to the KL-Divergence:

\[ \mathbf{H} = \nabla^2 D_{KL}(\pi_{\theta_{t}} \| \pi_{\theta_{t+1}}) \]

Problem

How can we design an algorithm that achieves stable policy updates like TRPO, but avoids the computational complexity of calculating the Hessian matrix? Could clipping the probability ratio be a simpler yet effective solution?

viewof r = Inputs.range([0, 2], {step: 0.01, value: 1.2, label: tex`r = \frac{\pi_\theta}{\pi_{\text{old}}}`, width: 250})
viewof A = Inputs.range([-2, 2], {step: 0.1, value: 1.0, label: tex`\hat{A}`, width: 250})
viewof eps = Inputs.range([0, 0.5], {step: 0.01, value: 0.2, label: tex`\epsilon`, width: 250})

// Function to compute PPO clipped objective at a single ratio
function ppoClipObj(ratio, advantage, epsilon) {
  const clip_ratio = Math.min(1 + epsilon, Math.max(1 - epsilon, ratio))
  if (advantage >= 0) {
    return Math.min(ratio * advantage, clip_ratio * advantage)
  } else {
    return Math.max(ratio * advantage, clip_ratio * advantage)
  }
}

// Compute current objective at slider r
objective = ppoClipObj(r, A, eps)

// Data for visualization
ratio_values = d3.range(0, 2.01, 0.01)
curve_data = ratio_values.map(rr => ({
  r: rr,
  unclipped_obj: rr * A,
  clipped_obj: ppoClipObj(rr, A, eps)
}))

// Plot the surrogate objective vs ratio
Plot.plot({
  style: "overflow: visible; display: block; margin: 0 auto;",
  width: 600,
  height: 400,
  y: {grid: true, label: "Objective Value"},
  x: {label: "Probability Ratio r", domain: [0,2]},
  marks: [
    Plot.line(curve_data, {x: "r", y: "unclipped_obj", stroke: "steelblue", strokeWidth: 2, label: "Unclipped"}),
    Plot.line(curve_data, {x: "r", y: "clipped_obj", stroke: "orange", strokeWidth: 2, label: "Clipped"}),
    Plot.ruleX([1 - eps, 1 + eps], {stroke: "red", strokeDasharray: "4,4"})
  ]
})

html`<div style="text-align: center; margin-top: 1em;">
  ${tex.block`L^{\text{CLIP}}(\theta) = 
    ${A >= 0 ? "\\min(r\\hat{A}, \\text{clip}(r,1-\\epsilon,1+\\epsilon)\\hat{A})" : 
                "\\max(r\\hat{A}, \\text{clip}(r,1-\\epsilon,1+\\epsilon)\\hat{A})"}`}
  ${tex.block`= ${objective.toFixed(3)}`}
  <p>${tex`\text{Red dashed lines show the clipping range } [1-\epsilon, 1+\epsilon]`}</p>
</div>`

Solution