6.4 Double Q-Learning
Double Q-Learning
One of the drawbacks of Q-Learning is the maximization bias, where maximization of action value estimates \(Q(s,a)\) is higher than those of the true action values \(q(s,a)\), leading to a bias.
Double Q-Learning addresses this bias by creating two action value estimates \(Q_{1}(s,a)\) and \(Q_{2}(s,a)\).
With equal likelihood, one action value estimate yields the maximization action \(A_{t}\) and the other provides the action value estimate \(Q(S_{t}, A_{t})\).
\[ Q_{1}(S_{t},A_{t}) = Q_{1}(S_{t},A_{t}) + \alpha [R_{t+1} + \gamma Q_{2}(S_{t+1},\max_{a} Q_{1}(S_{t+1},a)) - Q_{1}(S_{t},A_{t})] \]
\[ Q_{2}(S_{t},A_{t}) = Q_{2}(S_{t},A_{t}) + \alpha [R_{t+1} + \gamma Q_{1}(S_{t+1},\max_{a} Q_{2}(S_{t+1},a)) - Q_{2}(S_{t},A_{t})] \]
Pseudocode
\begin{algorithm} \caption{TD Double Q-Learning} \begin{algorithmic} \State \textbf{Initialize:} \State $Q_{1}(s, a) \gets 0$ for all $(s, a) \in S \times A$ \State $Q_{2}(s, a) \gets 0$ for all $(s, a) \in S \times A$ \State $\gamma \in [0, 1)$ \State $\alpha \in (0, 1]$ \State $\epsilon > 0$ \State $\pi \gets$ arbitrary $\epsilon$-soft policy \State \textbf{Loop for each episode:} \State Initialize $S_{0}$ \Repeat \State Choose $A_{t}$ from $S_{t}$ using $\pi$ \State Take action $A_{t}$, observe $R_{t+1}$ and $S_{t+1}$ \State With $0.5$ probability: \State $Q_{1}(S_{t}, A_{t}) \gets Q_{1}(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma Q_{2}(S_{t+1}, \arg\max_{a} Q_{1}(S_{t+1}, a)) - Q_{1}(S_{t}, A_{t})]$ \State \textbf{Else:} \State $Q_{2}(S_{t}, A_{t}) \gets Q_{2}(S_{t}, A_{t}) + \alpha [R_{t+1} + \gamma Q_{1}(S_{t+1}, \arg\max_{a} Q_{2}(S_{t+1}, a)) - Q_{2}(S_{t}, A_{t})]$ \Until{$S_{t}$ is terminal} \end{algorithmic} \end{algorithm}