9.4 Vanilla Policy Gradient
- ✅ We calculated the policy gradient analytically.
- ✅ We addressed the problem of sparse rewards.
- ✅ We know how to select actions in discrete and continuous action spaces.
WarningProblem
Now we just need an algorithm that:
- Leverages neural networks \(\theta\).
- Leverages a classical Reinforcement Learning method (Monte Carlo).
- Leverages our empirical estimate of the gradient \(\hat{g}\).
- Empirically performs well.
TipSolution
NoteQuestion 🤔
Match the following concepts:
| Concept | Notation |
|---|---|
| Likelihood Ratio | \(\frac{\nabla_{\theta} \pi_{\theta}(a|s)}{\pi_{\theta}(a|s)}\) |
| Score Function | \(\nabla_{\theta} \log \pi_{\theta}(a|s)\) |
| Policy Gradient | \(\mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi(a|s;\theta) Q^{\pi_\theta}(s, a)\right]\) |
| Empirical Estimate | \(\frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}|S_{t}, \theta) \hat{A}_{t}\) |
| Baseline | \(b(s)\) |