9.1 Policy Gradient Theorem
DQN is unstable and does not guarantee convergence.
Following deterministic or \(\epsilon\)-soft policies is not always optimal.
How can we leverage neural networks \(\theta\) to learn a policy \(\pi\) that is more adaptive than a \(\epsilon\)-soft policy?
Suppose you’re playing Rock-Paper-Scissors against an adaptive opponent:
- \(S\) – The current game round, including your opponent’s previous moves (i.e., context).
- \(A_{1,2,3}\) – You can choose between Rock, Paper, or Scissors.
- \(R\) – You receive a reward of +1 for a win, 0 for a tie, and -1 for a loss.
In this setup:

Following a deterministic or even \(\epsilon\)-soft policy can be exploited by an adaptive opponent who learns your behavior.
Stochastic Gradient Ascent
Policy gradient algorithms search for a local maximum in \(V^{\pi_{\theta}}(s)\) using stochastic gradient ascent (SGA):
\(S_0\) denotes the starting state of the process or episode. Analyzing the value at the initial state is important because it represents the expected return starting from the beginning of an episode. As opposed to a general state \(s\).
\[ \Delta \theta = \alpha \nabla_{\theta} V(S_{0}; \theta) \]
Policy Gradient Theorem
Assume \(\pi\) is differentiable where it is non-zero.
Ideally, we want to compute the gradient \(\nabla_{\theta} V(S_{0}; \theta)\) analytically.
To derive something analytically means to find an exact mathematical expression for it, rather than estimating it through sampling or approximation.
\[ V(S_{0}; \theta) = \sum_{a} \pi(a|S_{0}; \theta) Q(S_{0},a; \theta) \]
Step 1: Express Value Function in Terms of Trajectories
The value function \(V(S_{0}; \theta)\) can also be expressed in terms of trajectories \(\tau\):
\[ \begin{align*} V(S_{0}; \theta) &= \sum_{a} \pi(a|S_{0}; \theta) Q(S_{0},a; \theta) \\ &= \sum_{\tau} \underbrace{P(\tau; \theta)}_{\text{Probability of Trajectory}} \ \underbrace{R(\tau)}_{\text{Reward of Trajectory}} \end{align*} \]
Recall that a trajectory \(\tau\) is the tuple.
\(\tau = (S_{0},A_{0},R_{1}, ..., S_{T-1},A_{T-1},R_{T})\)
Step 2: Leverage Likelihood Ratios
Now we can take the gradient with respect to neural network parameters \(\theta\) using likelihood ratios:
\[ \nabla_\theta V(\theta) = \sum_{\tau} P(\tau; \theta) R(\tau) \nabla_\theta \text{log} P(\tau; \theta) \]
\[ \begin{align*} \nabla_\theta V(\theta) &= \nabla_\theta \sum_{\tau} P(\tau; \theta) R(\tau) \\ &= \sum_{\tau} \nabla_\theta P(\tau; \theta) R(\tau) \quad \gets \text{Swap} \sum \text{and} \nabla \\ &= \sum_{\tau} R(\tau) \nabla_\theta P(\tau; \theta) \quad \gets \text{Switch the order of multiplication} \\ &= \sum_{\tau} R(\tau) \nabla_\theta P(\tau; \theta) \frac{P(\tau; \theta)}{P(\tau; \theta)} \quad \gets \text{Multiply by} \ 1 \\ &= \sum_{\tau} R(\tau) P(\tau; \theta) \nabla_\theta \text{log} P(\tau; \theta) \quad \gets \text{Likelihood Ratio} \ \frac{\nabla_\theta P(\tau; \theta)}{P(\tau; \theta)} = \nabla_\theta \text{log} P(\tau; \theta)\\ &= \sum_{\tau} P(\tau; \theta) R(\tau) \nabla_\theta \text{log} P(\tau; \theta) \end{align*} \]
Step 3: Approximate Empirical Gradient
We can approximate the expectation using an empirical estimate for \(m\) sample trajectories:
\[ \nabla_{\theta} V(\theta) \approx \hat{g} = \frac{1}{m} \sum^{m}_{i = 1} R(\tau^{i}) \nabla_{\theta} \log P(\tau^{i}; \theta) \]
Problem is that we do not necessarily know the dynamics of trajectories \(\nabla_\theta \text{log} P(\tau^{i}; \theta)\)
Step 4: Decompose Dynamic Function
Luckily we can decompose the dynamics term into states and actions:
\[ \nabla_\theta \text{log} P(\tau^{i}; \theta) = \sum^{T-1}_{t=0} \text{log} \nabla_\theta \pi(A_{t}|S_{t}, \theta) \]
\[ \begin{align*} \hspace{-0.7cm} \nabla_\theta \text{log} P(\tau^{i}; \theta) &= \nabla_\theta \text{log} [\underbrace{\mu(S_{0})}_{\text{Initial State}} \prod^{T-1}_{t=0} \underbrace{\pi(A_{t}|S_{t};\theta)}_{\text{Policy}} \underbrace{P(S_{t+1}|S_{t},A_{t})}_{\text{Dynamic Function}}] \\ \hspace{-0.7cm} &= \nabla_\theta [\text{log}\mu(S_{0}) + \text{log}\sum^{T-1}_{t=0} \pi(A_{t}|S_{t};\theta) + \text{log} P(S_{t+1}|S_{t},A_{t})] \gets \text{Distribute log} \\ \hspace{-0.7cm} &= \underbrace{\nabla_\theta \text{log}\mu(S_{0})}_{=0} + \nabla_\theta \text{log}\sum^{T-1}_{t=0} \pi(A_{t}|S_{t};\theta) + \underbrace{\nabla_\theta \text{log} P(S_{t+1}|S_{t},A_{t})}_{=0} \\ \hspace{-0.7cm} &= \sum^{T-1}_{t=0} \text{log} \nabla_\theta \pi(A_{t}|S_{t}, \theta) \end{align*} \]
And we arrive to a term that does not depend on any dynamics, which we call the score function.
Let \(\pi(a|s;\theta)\) be a differentiable policy. The gradient of the expected reward \(F(\theta)\) with respect to the policy parameters \(\theta\) is given by:
\[ \nabla_\theta F(\theta) = \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi(a|s;\theta) Q^{\pi_\theta}(s, a)\right] \]