9.1 Policy Gradient Theorem

What if you nudged your strategy gently — based on how well it’s been working? 🎯

DQN is unstable and does not guarantee convergence.

Following deterministic or \(\epsilon\)-soft policies is not always optimal.

Problem

How can we leverage neural networks \(\theta\) to learn a policy \(\pi\) that is more adaptive than a \(\epsilon\)-soft policy?

Real Life Example 🧠

Suppose you’re playing Rock-Paper-Scissors against an adaptive opponent:

\(S\) – The current game round, including your opponent’s previous moves (i.e., context).
\(A_{1,2,3}\) – You can choose between Rock, Paper, or Scissors.
\(R\) – You receive a reward of +1 for a win, 0 for a tie, and -1 for a loss.

In this setup:

Following a deterministic or even \(\epsilon\)-soft policy can be exploited by an adaptive opponent who learns your behavior.

Solution

Stochastic Gradient Ascent

Policy gradient algorithms search for a local maximum in \(V^{\pi_{\theta}}(s)\) using stochastic gradient ascent (SGA):

\(S_0\) denotes the starting state of the process or episode. Analyzing the value at the initial state is important because it represents the expected return starting from the beginning of an episode. As opposed to a general state \(s\).

\[ \Delta \theta = \alpha \nabla_{\theta} V(S_{0}; \theta) \]

Policy Gradient Theorem

Assume \(\pi\) is differentiable where it is non-zero.

Ideally, we want to compute the gradient \(\nabla_{\theta} V(S_{0}; \theta)\) analytically.

To derive something analytically means to find an exact mathematical expression for it, rather than estimating it through sampling or approximation.

\[ V(S_{0}; \theta) = \sum_{a} \pi(a|S_{0}; \theta) Q(S_{0},a; \theta) \]

Step 1: Express Value Function in Terms of Trajectories

The value function \(V(S_{0}; \theta)\) can also be expressed in terms of trajectories \(\tau\):

\[ \begin{align*} V(S_{0}; \theta) &= \sum_{a} \pi(a|S_{0}; \theta) Q(S_{0},a; \theta) \\ &= \sum_{\tau} \underbrace{P(\tau; \theta)}_{\text{Probability of Trajectory}} \ \underbrace{R(\tau)}_{\text{Reward of Trajectory}} \end{align*} \]

Recall that a trajectory \(\tau\) is the tuple.

\(\tau = (S_{0},A_{0},R_{1}, ..., S_{T-1},A_{T-1},R_{T})\)

Step 2: Leverage Likelihood Ratios

Now we can take the gradient with respect to neural network parameters \(\theta\) using likelihood ratios:

\[ \nabla_\theta V(\theta) = \sum_{\tau} P(\tau; \theta) R(\tau) \nabla_\theta \text{log} P(\tau; \theta) \]

Derivation

\[ \begin{align*} \nabla_\theta V(\theta) &= \nabla_\theta \sum_{\tau} P(\tau; \theta) R(\tau) \\ &= \sum_{\tau} \nabla_\theta P(\tau; \theta) R(\tau) \quad \gets \text{Swap} \sum \text{and} \nabla \\ &= \sum_{\tau} R(\tau) \nabla_\theta P(\tau; \theta) \quad \gets \text{Switch the order of multiplication} \\ &= \sum_{\tau} R(\tau) \nabla_\theta P(\tau; \theta) \frac{P(\tau; \theta)}{P(\tau; \theta)} \quad \gets \text{Multiply by} \ 1 \\ &= \sum_{\tau} R(\tau) P(\tau; \theta) \nabla_\theta \text{log} P(\tau; \theta) \quad \gets \text{Likelihood Ratio} \ \frac{\nabla_\theta P(\tau; \theta)}{P(\tau; \theta)} = \nabla_\theta \text{log} P(\tau; \theta)\\ &= \sum_{\tau} P(\tau; \theta) R(\tau) \nabla_\theta \text{log} P(\tau; \theta) \end{align*} \]

Step 3: Approximate Empirical Gradient

We can approximate the expectation using an empirical estimate for \(m\) sample trajectories:

\[ \nabla_{\theta} V(\theta) \approx \hat{g} = \frac{1}{m} \sum^{m}_{i = 1} R(\tau^{i}) \nabla_{\theta} \log P(\tau^{i}; \theta) \]

Problem is that we do not necessarily know the dynamics of trajectories \(\nabla_\theta \text{log} P(\tau^{i}; \theta)\)

Step 4: Decompose Dynamic Function

Luckily we can decompose the dynamics term into states and actions:

\[ \nabla_\theta \text{log} P(\tau^{i}; \theta) = \sum^{T-1}_{t=0} \text{log} \nabla_\theta \pi(A_{t}|S_{t}, \theta) \]

Derivation

\[ \begin{align*} \hspace{-0.7cm} \nabla_\theta \text{log} P(\tau^{i}; \theta) &= \nabla_\theta \text{log} [\underbrace{\mu(S_{0})}_{\text{Initial State}} \prod^{T-1}_{t=0} \underbrace{\pi(A_{t}|S_{t};\theta)}_{\text{Policy}} \underbrace{P(S_{t+1}|S_{t},A_{t})}_{\text{Dynamic Function}}] \\ \hspace{-0.7cm} &= \nabla_\theta [\text{log}\mu(S_{0}) + \text{log}\sum^{T-1}_{t=0} \pi(A_{t}|S_{t};\theta) + \text{log} P(S_{t+1}|S_{t},A_{t})] \gets \text{Distribute log} \\ \hspace{-0.7cm} &= \underbrace{\nabla_\theta \text{log}\mu(S_{0})}_{=0} + \nabla_\theta \text{log}\sum^{T-1}_{t=0} \pi(A_{t}|S_{t};\theta) + \underbrace{\nabla_\theta \text{log} P(S_{t+1}|S_{t},A_{t})}_{=0} \\ \hspace{-0.7cm} &= \sum^{T-1}_{t=0} \text{log} \nabla_\theta \pi(A_{t}|S_{t}, \theta) \end{align*} \]

And we arrive to a term that does not depend on any dynamics, which we call the score function.

Policy Gradient Theorem

Let \(\pi(a|s;\theta)\) be a differentiable policy. The gradient of the expected reward \(F(\theta)\) with respect to the policy parameters \(\theta\) is given by:

\[ \nabla_\theta F(\theta) = \mathbb{E}_{\pi_\theta} \left[\nabla_\theta \log \pi(a|s;\theta) Q^{\pi_\theta}(s, a)\right] \]