8.1 Deep Learning

What if a system could build its own understanding — by adjusting millions of tiny knobs? 🔧

Assume state \(\mathsf{S}\) is represented by a tensor of continuous values.

\(\mathsf{S} =\)

A single parameter vector \(\mathbf{w}\) is insufficient to learn useful representations of the state tensor \(\mathsf{S}\).

Problem

How can we approximate action value functions \(\hat{Q}(s,a; \mathbf{w}) \approx Q_{\pi}(s,a)\) using differentiable methods to learn state tensor representations \(\mathsf{S}\)?

Solution

Multi-Layered Perceptrons (MLPs) (Hagan et al. 2014)

In Reinforcement Learning, MLPs are commonly used when states are represented by continuous-valued vectors \(\mathbf{s}\).

Illustration: MLP \(\theta\)

MLP Forward Propagation

At the beginning of MLP forward propagation, let \(\mathbf{p}\) be the input vector:

\[ \mathbf{a}^0 = \mathbf{p} \]

The input passes through \(M-1\) hidden layers:

\[ \mathbf{a}^{m+1} = \mathbf{f}^{m+1}(\mathbf{W}^{m+1} \cdot \mathbf{a}^m + \mathbf{b}^{m+1}) \quad \text{for } m = 0, 1, \dots, M-2 \]

\(\mathbf{a}^m\): Output vector of layer \(m\)
\(\mathbf{W}^{m+1}\): Weight matrix of layer \(m+1\)
\(\mathbf{b}^{m+1}\): Bias vector of layer \(m+1\)
\(\mathbf{f}^{m+1}()\): Activation function of layer \(m+1\)

For the final (output) layer \(M\):

\[ \mathbf{a}^M = \text{softmax}(\mathbf{W}^{M} \cdot \mathbf{a}^{M-1} + \mathbf{b}^{M}) \]

\(\mathbf{a}^{M-1}\): Output vector of the last hidden layer
\(\mathbf{W}^M\), \(\mathbf{b}^M\): Weights and bias for output layer
\(\text{softmax}()\): Softmax activation function

MLP Backpropagation

In MLP backpropagation, the sensitivity of the output layer is:

\[ \mathbf{s}^M = \mathbf{a} - \mathbf{t} \]

\(\mathbf{s}^M\): Sensitivity of the output layer
\(\mathbf{t}\): Target class vector

To update sensitivities for preceding layers:

\[ \mathbf{s}^m = \mathbf{F}^m \cdot \mathbf{n}^m \cdot \mathbf{W}^{m+1^\top} \cdot \mathbf{s}^{m+1} \quad \text{for } m = M-1, \dots, 2, 1 \]

\(\mathbf{s}^m\): Sensitivity of layer \(m\)
\(\mathbf{F}^m\): Derivative of activation function of layer \(m\)
\(\mathbf{n}^m\): Input vector to layer \(m\)
\(\mathbf{W}^{m+1^\top}\): Transpose of weight matrix of layer \(m+1\)
\(\mathbf{s}^{m+1}\): Sensitivity of layer \(m+1\)

MLP Weight and Bias Updates

The weights are updated using the rule:

\[ \mathbf{W}_{k+1}^m = \mathbf{W}_k^m - \alpha \, \mathbf{s}^m \cdot (\mathbf{a}^{{m-1}^\top}) \]

\(\mathbf{W}_k^m\): Weight matrix of layer \(m\) at iteration \(k\)
\(\mathbf{W}_{k+1}^m\): Weight matrix at iteration \(k+1\)
\(\mathbf{s}^m\): Sensitivity of layer \(m\)
\(\mathbf{a}^{{m-1}^\top}\): Transpose of the output from layer \(m-1\)
\(\alpha\): Learning rate

The biases are updated as:

\[ \mathbf{b}_{k+1}^m = \mathbf{b}_k^m - \alpha \, \mathbf{s}^m \]

\(\mathbf{b}_k^m\): Bias vector of layer \(m\) at iteration \(k\)
\(\mathbf{b}_{k+1}^m\): Bias vector at iteration \(k+1\)

Convolutional Neural Networks (CNNs) (Martin T. Hagan 2024)

In Reinforcement Learning, CNNs are commonly used when states are represented by continuous-valued tensors \(\mathsf{S}\).

Illustration: CNN \(\theta\)

CNN Forward Propagation

At the beginning of CNN forward propagation, let \(\mathsf{P}\) be the input tensor:

\[ \mathsf{A}^0 = \mathsf{P} \]

For each convolutional layer, the input passes through a convolutional kernel followed by a non-linear activation function:

\[ \mathsf{A}^m = \mathbf{f}^m\left( \mathbf{W}^m \ast \mathsf{A}^{m-1} + \mathbf{B}^m \right) \quad \text{for } m = 1, 2, \dots, M \]

\(\mathsf{A}^{m-1}\): Input to layer \(m\)
\(\mathbf{W}^m\): Convolutional kernel(s) at layer \(m\)
\(\mathbf{B}^m\): Bias tensor at layer \(m\)
\(\mathbf{f}^m()\): Activation function (e.g., ReLU) at layer \(m\)
\(\ast\): Convolution operator

Optionally, a pooling operation can be applied after certain convolutional layers:

\[ \mathsf{A}^m_{\text{pool}} = \boxplus^{\text{pool}} \mathsf{A}^m \]

\(\boxplus^{\text{pool}}\): Pooling operator (e.g., max or average) to reduce spatial dimensions

Finally, we flatten the final feature maps into a vector for input into the fully connected layers:

\[ \mathbf{a}^{\text{flat}} = \text{flatten}(\mathsf{A}^M) \]

\(\text{flatten}()\): Converts the final 3D tensor into a 1D feature vector
\(\mathsf{A}^M\): Output of the final convolutional layer

This flattened vector then serves as input to the MLP component:

\[ \mathbf{a}^0_{\text{MLP}} = \mathbf{a}^{\text{flat}} \]

CNN Backpropagation

During CNN backpropagation, we compute gradients of the loss function \(F\) with respect to various intermediate variables.

\[ \mathsf{dA}^m \equiv \frac{\partial F}{\partial \mathsf{A}^m}, \quad \mathsf{dZ}^m \equiv \frac{\partial F}{\partial \mathsf{Z}^m}, \quad \mathbf{dW}^m \equiv \frac{\partial F}{\partial \mathbf{W}^m}, \quad \mathbf{dB}^m \equiv \frac{\partial F}{\partial \mathbf{B}^m} \]

\(\mathsf{A}^m\): Activated output of layer \(m\)
\(\mathsf{Z}^m\): Pre-activation output of layer \(m\) (before activation function)
\(\mathbf{W}^m\): Convolutional kernels of layer \(m\)
\(\mathbf{B}^m\): Bias tensor of layer \(m\)

CNN Weight and Bias Updates

The convolutional weights are updated using gradient descent as follows:

\[ \mathbf{W}_{k+1}^m = \mathbf{W}_k^m - \alpha \, \mathbf{dW}^m \]

\(\mathbf{W}_k^m\): Convolutional filter(s) of layer \(m\) at iteration \(k\)
\(\mathbf{W}_{k+1}^m\): Updated filter(s) at iteration \(k+1\)
\(\mathbf{dW}^m\): Gradient of the loss w.r.t. the weights of layer \(m\)
\(\alpha\): Learning rate

The biases are updated similarly:

\[ \mathbf{B}_{k+1}^m = \mathbf{B}_k^m - \alpha \, \mathbf{dB}^m \]

\(\mathbf{B}_k^m\): Bias tensor of layer \(m\) at iteration \(k\)
\(\mathbf{B}_{k+1}^m\): Updated bias tensor at iteration \(k+1\)
\(\mathbf{dB}^m\): Gradient of the loss w.r.t. the biases of layer \(m\)

Question 🤔

Assume that our state tensor \(\mathsf{S}\) is an image.

Based on your intuition, do you think it makes sense to use raw pixels as input instead of preprocessed state features?

\(\mathsf{S} =\)

Solution