8.1 Deep Learning
Assume state \(\mathsf{S}\) is represented by a tensor of continuous values.
\(\mathsf{S} =\)
A single parameter vector \(\mathbf{w}\) is insufficient to learn useful representations of the state tensor \(\mathsf{S}\).
How can we approximate action value functions \(\hat{Q}(s,a; \mathbf{w}) \approx Q_{\pi}(s,a)\) using differentiable methods to learn state tensor representations \(\mathsf{S}\)?
Multi-Layered Perceptrons (MLPs) (Hagan et al. 2014)
In Reinforcement Learning, MLPs are commonly used when states are represented by continuous-valued vectors \(\mathbf{s}\).
MLP Forward Propagation
At the beginning of MLP forward propagation, let \(\mathbf{p}\) be the input vector:
\[ \mathbf{a}^0 = \mathbf{p} \]
The input passes through \(M-1\) hidden layers:
\[ \mathbf{a}^{m+1} = \mathbf{f}^{m+1}(\mathbf{W}^{m+1} \cdot \mathbf{a}^m + \mathbf{b}^{m+1}) \quad \text{for } m = 0, 1, \dots, M-2 \]
\(\mathbf{a}^m\): Output vector of layer \(m\)
\(\mathbf{W}^{m+1}\): Weight matrix of layer \(m+1\)
\(\mathbf{b}^{m+1}\): Bias vector of layer \(m+1\)
\(\mathbf{f}^{m+1}()\): Activation function of layer \(m+1\)
For the final (output) layer \(M\):
\[ \mathbf{a}^M = \text{softmax}(\mathbf{W}^{M} \cdot \mathbf{a}^{M-1} + \mathbf{b}^{M}) \]
\(\mathbf{a}^{M-1}\): Output vector of the last hidden layer
\(\mathbf{W}^M\), \(\mathbf{b}^M\): Weights and bias for output layer
\(\text{softmax}()\): Softmax activation function
MLP Backpropagation
In MLP backpropagation, the sensitivity of the output layer is:
\[ \mathbf{s}^M = \mathbf{a} - \mathbf{t} \]
\(\mathbf{s}^M\): Sensitivity of the output layer
\(\mathbf{t}\): Target class vector
To update sensitivities for preceding layers:
\[ \mathbf{s}^m = \mathbf{F}^m \cdot \mathbf{n}^m \cdot \mathbf{W}^{m+1^\top} \cdot \mathbf{s}^{m+1} \quad \text{for } m = M-1, \dots, 2, 1 \]
\(\mathbf{s}^m\): Sensitivity of layer \(m\)
\(\mathbf{F}^m\): Derivative of activation function of layer \(m\)
\(\mathbf{n}^m\): Input vector to layer \(m\)
\(\mathbf{W}^{m+1^\top}\): Transpose of weight matrix of layer \(m+1\)
\(\mathbf{s}^{m+1}\): Sensitivity of layer \(m+1\)
MLP Weight and Bias Updates
The weights are updated using the rule:
\[ \mathbf{W}_{k+1}^m = \mathbf{W}_k^m - \alpha \, \mathbf{s}^m \cdot (\mathbf{a}^{{m-1}^\top}) \]
\(\mathbf{W}_k^m\): Weight matrix of layer \(m\) at iteration \(k\)
\(\mathbf{W}_{k+1}^m\): Weight matrix at iteration \(k+1\)
\(\mathbf{s}^m\): Sensitivity of layer \(m\)
\(\mathbf{a}^{{m-1}^\top}\): Transpose of the output from layer \(m-1\)
\(\alpha\): Learning rate
The biases are updated as:
\[ \mathbf{b}_{k+1}^m = \mathbf{b}_k^m - \alpha \, \mathbf{s}^m \]
\(\mathbf{b}_k^m\): Bias vector of layer \(m\) at iteration \(k\)
\(\mathbf{b}_{k+1}^m\): Bias vector at iteration \(k+1\)
Convolutional Neural Networks (CNNs) (Martin T. Hagan 2024)
In Reinforcement Learning, CNNs are commonly used when states are represented by continuous-valued tensors \(\mathsf{S}\).
CNN Forward Propagation
At the beginning of CNN forward propagation, let \(\mathsf{P}\) be the input tensor:
\[ \mathsf{A}^0 = \mathsf{P} \]
For each convolutional layer, the input passes through a convolutional kernel followed by a non-linear activation function:
\[ \mathsf{A}^m = \mathbf{f}^m\left( \mathbf{W}^m \ast \mathsf{A}^{m-1} + \mathbf{B}^m \right) \quad \text{for } m = 1, 2, \dots, M \]
\(\mathsf{A}^{m-1}\): Input to layer \(m\)
\(\mathbf{W}^m\): Convolutional kernel(s) at layer \(m\)
\(\mathbf{B}^m\): Bias tensor at layer \(m\)
\(\mathbf{f}^m()\): Activation function (e.g., ReLU) at layer \(m\)
\(\ast\): Convolution operator
Optionally, a pooling operation can be applied after certain convolutional layers:
\[ \mathsf{A}^m_{\text{pool}} = \boxplus^{\text{pool}} \mathsf{A}^m \]
\(\boxplus^{\text{pool}}\): Pooling operator (e.g., max or average) to reduce spatial dimensions
Finally, we flatten the final feature maps into a vector for input into the fully connected layers:
\[ \mathbf{a}^{\text{flat}} = \text{flatten}(\mathsf{A}^M) \]
\(\text{flatten}()\): Converts the final 3D tensor into a 1D feature vector
\(\mathsf{A}^M\): Output of the final convolutional layer
This flattened vector then serves as input to the MLP component:
\[ \mathbf{a}^0_{\text{MLP}} = \mathbf{a}^{\text{flat}} \]
CNN Backpropagation
During CNN backpropagation, we compute gradients of the loss function \(F\) with respect to various intermediate variables.
\[ \mathsf{dA}^m \equiv \frac{\partial F}{\partial \mathsf{A}^m}, \quad \mathsf{dZ}^m \equiv \frac{\partial F}{\partial \mathsf{Z}^m}, \quad \mathbf{dW}^m \equiv \frac{\partial F}{\partial \mathbf{W}^m}, \quad \mathbf{dB}^m \equiv \frac{\partial F}{\partial \mathbf{B}^m} \]
\(\mathsf{A}^m\): Activated output of layer \(m\)
\(\mathsf{Z}^m\): Pre-activation output of layer \(m\) (before activation function)
\(\mathbf{W}^m\): Convolutional kernels of layer \(m\)
\(\mathbf{B}^m\): Bias tensor of layer \(m\)
CNN Weight and Bias Updates
The convolutional weights are updated using gradient descent as follows:
\[ \mathbf{W}_{k+1}^m = \mathbf{W}_k^m - \alpha \, \mathbf{dW}^m \]
\(\mathbf{W}_k^m\): Convolutional filter(s) of layer \(m\) at iteration \(k\)
\(\mathbf{W}_{k+1}^m\): Updated filter(s) at iteration \(k+1\)
\(\mathbf{dW}^m\): Gradient of the loss w.r.t. the weights of layer \(m\)
\(\alpha\): Learning rate
The biases are updated similarly:
\[ \mathbf{B}_{k+1}^m = \mathbf{B}_k^m - \alpha \, \mathbf{dB}^m \]
\(\mathbf{B}_k^m\): Bias tensor of layer \(m\) at iteration \(k\)
\(\mathbf{B}_{k+1}^m\): Updated bias tensor at iteration \(k+1\)
\(\mathbf{dB}^m\): Gradient of the loss w.r.t. the biases of layer \(m\)
Assume that our state tensor \(\mathsf{S}\) is an image.
Based on your intuition, do you think it makes sense to use raw pixels as input instead of preprocessed state features?
\(\mathsf{S} =\)