9.4 Vanilla Policy Gradient

What if you improved your strategy by following the slope — no tricks, no constraints, just raw feedback? 🧗‍♂️

WarningProblem

Now we just need an algorithm that:

  • Leverages neural networks \(\theta\).
  • Leverages a classical Reinforcement Learning method (Monte Carlo).
  • Leverages our empirical estimate of the gradient \(\hat{g}\).
  • Empirically performs well.
NoteQuestion 🤔

Match the following concepts:

Concept Notation
Likelihood Ratio \(\frac{\nabla_{\theta} \pi_{\theta}(a|s)}{\pi_{\theta}(a|s)}\)
Score Function \(\nabla_{\theta} \log \pi_{\theta}(a|s)\)
Policy Gradient \(\mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi(a|s;\theta) Q^{\pi_\theta}(s, a)\right]\)
Empirical Estimate \(\frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}|S_{t}, \theta) \hat{A}_{t}\)
Baseline \(b(s)\)