9.2 Multinomial Logistic Regression
Now, we assume that there K many unordered, discrete classes \mathcal{Y}=\{\tt 1, \tt 2, \ldots, \tt K\}, with possibly K>2. Examples include predicting:
- the subspecies of Iris flower (K=3),
- handwritten digits (K=10).
In this case, we use Multinomial Logistic Regression (a.k.a. Softmax Regression).
Multiclass Output
Since the label is exactly one of the K classes, we apply one-hot encoding for y. We assume that the class of an observation is given by a binary K-vector \bold{y}=(y_{1}, y_{2}, \ldots, y_{K}), where only one of the entries is 1. In other words, the observation belongs to k\in\{{\tt 1}, {\tt 2},\ldots, {\tt K}\} is equivalent to saying that the k-th entry y_k of output vector \bold{y} is 1 and all other entries are 0, i.e., \bold{y}=(0, 0, \ldots, \underbrace{\quad 1\quad}_{k-\text{th entry}}, \ldots, 0).
Model Assumptions
the conditional probability p(\bold{y}\mid \bold{x}) follows a Categorical Distribution to model the probability distribution of a K-sided dice: p(\bold{y})=\begin{cases} p_{1}&\text{ if }\bold{y}=(1,0,\ldots,0) \\ p_{2}&\text{ if }\bold{y}=(0,1,\ldots,0) \\ \vdots \\ p_{K}&\text{ if }\bold{y}=(0,0,\ldots,1). \end{cases} Here, p_1, p_2,\ldots, p_K are class probabilities with \sum_{k=1}^K p_k=1 and p_k\in[0,1] for all 1\leq k\leq K.
to each class k\in\{1, \ldots, K\}, we associate unknown (d+1) weights w_{k0}, w_{k1}, \ldots, w_{kd} corresponding to the feature vector \bold{x}=(1, x_1, x_2, \ldots, x_d). This produces the (d+1)\times k weight matrix \pmb{\theta}: \begin{bmatrix}\bold{w_0} &\bold{w_1} &\ldots&\bold{w_d}\end{bmatrix}^T= \begin{bmatrix} w_{10} & w_{11} & w_{12} & \ldots & w_{1d} \\ w_{20} & w_{21} & w_{22} & \ldots & w_{2d} \\ w_{30} & w_{31} & w_{32} & \ldots & w_{3d} \\ \vdots & \vdots & \vdots & \ddots & \vdots\\ w_{K0} & w_{K1} & w_{K2} & \ldots & w_{Kd} \\ \end{bmatrix}^T.
the k-th class probability p_k given by the softmax1 of linearly weighted functions of \bold{x}. p_k=\mathrm{softmax} \left(\begin{bmatrix}\bold{w}_0&\bold{w}_1&\ldots&\bold{w}_d\end{bmatrix}^T\bold{x}\right).
1 \mathrm{softmax}\colon\mathbb{R}^K\to[0,1]^K converts a K-tuple (t_1, t_2, \ldots, t_k) of real numbers into K class probabilities. \mathrm{softmax}(t_1, \ldots, t_K)=: \frac{1}{\sum_{k=1}^K e^{t_k}}(e^{t_1}, \ldots, e^{t_K})
The (d+1)\times K weights in \pmb{\theta} are the parameters \pmb{\theta} of the model. Since the number of parameters does not grow with the size of the data, this is a still parametric model.