4.3 Iterative Policy Evaluation
Markov Decision Process (MDP) is a framework for an associative environment.
A MDP is a model for transitions that are controlled between fully observable states.
How can we find an optimal policy \(\pi_{*}\), assuming that we have perfect model of state transitions \(P(s', r \mid s, a)\)?
Dynamic Programming is a collection of algorithms that can be used to compute optimal policies \(\pi_{*}\) in tabular state spaces.
These algorithms have limited utility in Reinforcement Learning due to:
- Assumption of a perfect model: All state transitions \(P(s', r \mid s, a)\) are known in advance.
- Computational expense: Dynammic Programming typically requires full sweeps over the state space \(\forall s \in S\), which is only feasible in small, tabular environments.
We know how good it is to follow the current policy from \(s\) β that is \(v_{\pi}(s)\) β but would it be better or worse to change to a new policy \(\pi^{'}\)?
Imagine your commute to work every day:
- \(S\) β The location youβre currently in (e.g., your home or a traffic junction). More generally, \(S_{1,...,k}\) can represent multiple possible locations.
- \(A_{1,...,k}\) β The route you choose (e.g., highway, back streets, scenic route, parkway, or alternate street).
- \(R\) β Your reward could be getting to work quickly, stress-free, or on time.




Suppose your current route always takes the side street, and on average it takes 35 minutes to reach work. One day, you try a different route and notice it only takes 30 minutes, even though it looked longer on the map.
How could you systematically figure out which route is truly the best, and decide whether to stick with your usual route or switch to a new one?
One way to check if it is better to switch from policy \(\pi\) to \(\pi^{'}\) is by checking if the following inequality holds:
\[ q_{\pi}(s, \pi^{'}(s)) \geq v_{\pi}(s) \]
If selecting \(a\) in \(s\) and thereafter following policy \(\pi\) is better than just following \(\pi\), there must be a better policy \(\pi^{'}\).
The special case when this inequality is true is referred to as the policy improvement theorem.