10.1 Polynomial Logistic Regression
: 90 minutes
Recall that binary logistic regression models the posterior probability as follows: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x_1+\ldots+w_dx_d)}}. \tag{28.1}
Note that Equation 28.1 can alternatively be written as: \log{\frac{p({\tt 1}\mid\bold{x}; \bold{w})}{1-p({\tt 1}\mid\bold{x}; \bold{w})}} =w_0+w_1x_1+\ldots+w_dx_d. \tag{28.2} The quantity on the left is known as log odds.
Quadratic Logistic Regression
Recall that in linear regression we can add polynomial terms to make the model more general. The quadratic regression with just one predictor x uses the following model: y=w_0+w_1x+w_2x^2. Similarly in logistic regression with only one predictor x, we can add a polynomial term to Equation 28.1: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x+w_2x^2)}}. \tag{28.3} This is called quadratic logistic regression for one feature. Analogously, quadratic logistic regression for two features \bold{x}=(x_1, x_2) adds three polynomial terms to Equation 28.1: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x_1+w_2x_2+w_3x_1^2+w_4x_1x_2+w_5x_2^2)}}. \tag{28.4}
Demo Problem
Let us create a random dataset \mathcal{D} of size N= from an unknown population. Each sample point has two features \bold{x}=(x_1, x_2) and the label y\in\{{\tt 0}, {\tt 1}\}.
Let us first try our usual logistic regression classifier to the data.
We run the model and print the classification report.
The trained coefficients are:
Decision Boundary
The decision boundary of a classification model divides the feature space into different classes.
In case of logistic regression, the decision boundary is always linear.
Polynomial Logistic Regression
So, we ask if the decision boundary for logistic regression can be curved or non-linear? Well, the answer is Yes if we run polynomial logistic regression with degree K\geq1.
When the degree K=1, it becomes our usual logistic regression. With higher degree K, the logistic regression model becomes more and more flexible, i.e., more and more parameters or weights. For example, see Equation 28.3 and Equation 28.4 for degree K=2 and number of features 1 and 2, respectively.
Increase the degree K to see how it changes the decision boundary and model accuracy.
Overfitting
When a machine learning model learns the training data too well, to the point that it also memorizes random noise and irrelevant details, causing it to perform poorly on new, unseen data.
We already briefly introduced in the context of polynomial regression. In this subsection, we take another, look at overfitting in the context of classification.
For a better assessment, let us plot training and test accuracies across different values of K.
A Remedy: Regularization
One possible remedy is to penalize the loss function for model complexity. In this case, we use the magnitude of the model weights \bold{w} as a proxy for model complexity.
We see that with higher degree polynomials, the magnitude of the coefficients gets larger. So, we use the following loss function: \mathcal L_{\lambda}(\bold{w})=\mathcal{L}(\bold{w}) + \lambda\|\bold{w}\|_2^2, \tag{28.5} The first term is the usual average binary cross entropy as defined in Equation 26.4. And, the second term above is just the sum of squares of the weights. This is called l_2-regularization or weight decay. The larger the value of \lambda>0, the more the weights are penalized for being “large” (deviating from the zero-mean prior),and thus the less flexible the model.
Commonly, C=\frac{1}{\lambda} is used to set the strenght of the regularizer.
l_2 Regularized
For a better assessment, let us plot training and test accuracies across different values of K.
C is a hyperparameter in the model. We will see later how to choose the best value of C using cross-validation.