8.2 Deep Q-Networks (DQN)
Neural networks \(\theta\) such as MLPs and CNNs can help us learn representations of state tensors \(\mathsf{S}\).
\(\mathsf{S} =\)
Now we just need an algorithm that:
- Leverages neural networks \(\theta\).
- Leverages a classical Reinforcement Learning method (TD: Q-learning).
- Implements batch form.
- Empirically performs well.
Consider the following Atari Breakout environment (Bellemare et al. 2013):
Suppose we set our batch to \(4\).
The state information \(\mathsf{S}_{\text{batch}}\) is typically represented as a stack of four consecutive tensor image frames \(\mathsf{S}\), capturing temporal dynamics and motion across time:
\[ \mathsf{S}_{\text{batch}} = \begin{bmatrix} \mathsf{S}_t \gets \text{frame at time } t \\ \mathsf{S}_{t-1} \gets \text{frame at time } t-1 \\ \mathsf{S}_{t-2} \gets \text{frame at time } t-2 \\ \mathsf{S}_{t-3} \gets \text{frame at time } t-3 \\ \end{bmatrix} \]
The environment has a discrete action space \(\mathcal{A}\):
\[ \mathcal{A} = \{0 \gets \text{Do nothing}, 1 \gets \text{Fire}, 2 \gets \text{Move right}, 3 \gets \text{Move left}\} \]
The environment’s state transition dynamics \(P(s’, r \mid s, a)\) are governed by a physics engine that updates the positions of the ball, paddle, and bricks based on collisions and the selected action. These dynamics are deterministic but complex due to pixel-based changes.
The reward \(R\) is defined by the number of bricks destroyed:
- Hitting and destroying a brick typically yields a reward of \(+1\).
- Missing the ball (letting it fall) typically ends a life but gives \(0\) reward.
The episode ends \(d\) (dones) if either of the following happens:
- Termination: The agent loses all lives (typically 5).
- Truncation: The agent clears all bricks or reaches a built-in time limit (varies by implementation).
Is there a mistake with the following loss function for DQN?
\[ F(\theta_{t}) = (\text{TD-Target}_{j} - \hat{Q}(S_{j+1},a; \theta_{t}))^{2} \]
Why does this suggest that we are updating the MSE of a scalar and a vector?
What is actually happening here during the gradient update step?
