4.3 Iterative Policy Evaluation

How do we measure the long-term value of a plan if we keep trying it over and over again? πŸ”

Markov Decision Process (MDP) is a framework for an associative environment.

A MDP is a model for transitions that are controlled between fully observable states.

WarningProblem

How can we find an optimal policy \(\pi_{*}\), assuming that we have perfect model of state transitions \(P(s', r \mid s, a)\)?

Dynamic Programming is a collection of algorithms that can be used to compute optimal policies \(\pi_{*}\) in tabular state spaces.

These algorithms have limited utility in Reinforcement Learning due to:

  • Assumption of a perfect model: All state transitions \(P(s', r \mid s, a)\) are known in advance.
  • Computational expense: Dynammic Programming typically requires full sweeps over the state space \(\forall s \in S\), which is only feasible in small, tabular environments.
NoteQuestion πŸ€”

We know how good it is to follow the current policy from \(s\) β€” that is \(v_{\pi}(s)\) β€” but would it be better or worse to change to a new policy \(\pi^{'}\)?

NoteReal Life Example 🧠

Imagine your commute to work every day:

  • \(S\) β€” The location you’re currently in (e.g., your home or a traffic junction). More generally, \(S_{1,...,k}\) can represent multiple possible locations.
  • \(A_{1,...,k}\) β€” The route you choose (e.g., highway, back streets, scenic route, parkway, or alternate street).
  • \(R\) β€” Your reward could be getting to work quickly, stress-free, or on time.

Action \(A_1\): Choosing the highway from your current location \(S\). Reward \(R\): Travel time may be shorter or longer depending on traffic.

Action \(A_2\): Taking the side street from your current location \(S\). Reward \(R\): Usually reliable, but travel time is a bit longer on average.

Action \(A_3\): Taking the parkway from your current location \(S\). Reward \(R\): Might have moderate traffic but pleasant scenery.

Action \(A_4\): Using an alternate back road from your current location \(S\). Reward \(R\): Unfamiliar, so travel time is uncertain.

Suppose your current route always takes the side street, and on average it takes 35 minutes to reach work. One day, you try a different route and notice it only takes 30 minutes, even though it looked longer on the map.

How could you systematically figure out which route is truly the best, and decide whether to stick with your usual route or switch to a new one?

One way to check if it is better to switch from policy \(\pi\) to \(\pi^{'}\) is by checking if the following inequality holds:

\[ q_{\pi}(s, \pi^{'}(s)) \geq v_{\pi}(s) \]

If selecting \(a\) in \(s\) and thereafter following policy \(\pi\) is better than just following \(\pi\), there must be a better policy \(\pi^{'}\).

The special case when this inequality is true is referred to as the policy improvement theorem.