8.2 Deep Q-Networks (DQN)

What if a neural network could learn to act — one game, one pixel, one reward at a time? 🕹️

Neural networks \(\theta\) such as MLPs and CNNs can help us learn representations of state tensors \(\mathsf{S}\).

\(\mathsf{S} =\)

Problem

Now we just need an algorithm that:

Leverages neural networks \(\theta\).
Leverages a classical Reinforcement Learning method (TD: Q-learning).
Implements batch form.
Empirically performs well.

Example: Breakout

Consider the following Atari Breakout environment (Bellemare et al. 2013):

Suppose we set our batch to \(4\).

The state information \(\mathsf{S}_{\text{batch}}\) is typically represented as a stack of four consecutive tensor image frames \(\mathsf{S}\), capturing temporal dynamics and motion across time:

\[ \mathsf{S}_{\text{batch}} = \begin{bmatrix} \mathsf{S}_t \gets \text{frame at time } t \\ \mathsf{S}_{t-1} \gets \text{frame at time } t-1 \\ \mathsf{S}_{t-2} \gets \text{frame at time } t-2 \\ \mathsf{S}_{t-3} \gets \text{frame at time } t-3 \\ \end{bmatrix} \]

The environment has a discrete action space \(\mathcal{A}\):

\[ \mathcal{A} = \{0 \gets \text{Do nothing}, 1 \gets \text{Fire}, 2 \gets \text{Move right}, 3 \gets \text{Move left}\} \]

The environment’s state transition dynamics \(P(s’, r \mid s, a)\) are governed by a physics engine that updates the positions of the ball, paddle, and bricks based on collisions and the selected action. These dynamics are deterministic but complex due to pixel-based changes.

The reward \(R\) is defined by the number of bricks destroyed:

Hitting and destroying a brick typically yields a reward of \(+1\).
Missing the ball (letting it fall) typically ends a life but gives \(0\) reward.

The episode ends \(d\) (dones) if either of the following happens:

Termination: The agent loses all lives (typically 5).
Truncation: The agent clears all bricks or reaches a built-in time limit (varies by implementation).

Solution

Question 🤔

Is there a mistake with the following loss function for DQN?

\[ F(\theta_{t}) = (\text{TD-Target}_{j} - \hat{Q}(S_{j+1},a; \theta_{t}))^{2} \]

Why does this suggest that we are updating the MSE of a scalar and a vector?

What is actually happening here during the gradient update step?

Solution