9.4 Vanilla Policy Gradient

What if you improved your strategy by following the slope — no tricks, no constraints, just raw feedback? 🧗‍♂️

Problem

Now we just need an algorithm that:

Solution

Question 🤔

Match the following concepts:

Concept	Notation
Likelihood Ratio	\(\frac{\nabla_{\theta} \pi_{\theta}(a\|s)}{\pi_{\theta}(a\|s)}\)
Score Function	\(\nabla_{\theta} \log \pi_{\theta}(a\|s)\)
Policy Gradient	\(\mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi(a\|s;\theta) Q^{\pi_\theta}(s, a)\right]\)
Empirical Estimate	\(\frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}\|S_{t}, \theta) \hat{A}_{t}\)
Baseline	\(b(s)\)