10.1 Polynomial Logistic Regression


: 90 minutes

Recall that binary logistic regression models the posterior probability as follows: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x_1+\ldots+w_dx_d)}}. \tag{28.1}

Tip Parametric vs Non-parametric

Exercise 28.1 Is logistic regression a parametric or non-parametric model?

Note that Equation 28.1 can alternatively be written as: \log{\frac{p({\tt 1}\mid\bold{x}; \bold{w})}{1-p({\tt 1}\mid\bold{x}; \bold{w})}} =w_0+w_1x_1+\ldots+w_dx_d. \tag{28.2} The quantity on the left is known as log odds.

Tip Logistic vs Linear Regression

Exercise 28.2 Compare and contrast.

Tip Interpretation in Logistic Regression

Exercise 28.3 Interpret the weights/coefficients w_j in Equation 28.2.

Quadratic Logistic Regression

Recall that in linear regression we can add polynomial terms to make the model more general. The quadratic regression with just one predictor x uses the following model: y=w_0+w_1x+w_2x^2. Similarly in logistic regression with only one predictor x, we can add a polynomial term to Equation 28.1: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x+w_2x^2)}}. \tag{28.3} This is called quadratic logistic regression for one feature. Analogously, quadratic logistic regression for two features \bold{x}=(x_1, x_2) adds three polynomial terms to Equation 28.1: p({\tt 1}\mid\bold{x}; \bold{w}) =\frac{1}{1+e^{-(w_0+w_1x_1+w_2x_2+w_3x_1^2+w_4x_1x_2+w_5x_2^2)}}. \tag{28.4}

Tip Model Parameters

Exercise 28.4 How many parameters do we train in quadratic logistic regression model (Equation 28.4)?

Demo Problem

Let us create a random dataset \mathcal{D} of size N= from an unknown population. Each sample point has two features \bold{x}=(x_1, x_2) and the label y\in\{{\tt 0}, {\tt 1}\}.

Let us first try our usual logistic regression classifier to the data.

Tip Choosing Classifier

Exercise 28.5 Which of the following is more appropriate for the classification task above: binary or multinomial logistic regression?

Tip Choosing Classifier

Exercise 28.6 How many parameters or weights do we train in this case?

We run the model and print the classification report.

Tip Penalty

Exercise 28.7 Consult the documentation to figure out the impact of the argument None.

The trained coefficients are:

Tip Interpretation

Exercise 28.8 Interpret the above coefficients for x_1.

Decision Boundary

The decision boundary of a classification model divides the feature space into different classes.

In case of logistic regression, the decision boundary is always linear.

Tip Why Linear?

Exercise 28.9 Justify why the decision boundary of binary logistic regression is always linear.

Tip Decision Boundary

Exercise 28.10 Why do you think the training accuracy is not 1? Isn’t it naturally expected?

Polynomial Logistic Regression

So, we ask if the decision boundary for logistic regression can be curved or non-linear? Well, the answer is Yes if we run polynomial logistic regression with degree K\geq1.

When the degree K=1, it becomes our usual logistic regression. With higher degree K, the logistic regression model becomes more and more flexible, i.e., more and more parameters or weights. For example, see Equation 28.3 and Equation 28.4 for degree K=2 and number of features 1 and 2, respectively.

Increase the degree K to see how it changes the decision boundary and model accuracy.

Overfitting

When a machine learning model learns the training data too well, to the point that it also memorizes random noise and irrelevant details, causing it to perform poorly on new, unseen data.

We already briefly introduced in the context of polynomial regression. In this subsection, we take another, look at overfitting in the context of classification.

Tip Degree and Overfitting

Exercise 28.11 Do you think overfitting in polynomial regression becomes less severe with higher degree?

For a better assessment, let us plot training and test accuracies across different values of K.

A Remedy: Regularization

One possible remedy is to penalize the loss function for model complexity. In this case, we use the magnitude of the model weights \bold{w} as a proxy for model complexity.

Tip Model Weights vs Degree

Exercise 28.12 How to change the code under section Polynomial Logistic Regression to print the learned weights?

We see that with higher degree polynomials, the magnitude of the coefficients gets larger. So, we use the following loss function: \mathcal L_{\lambda}(\bold{w})=\mathcal{L}(\bold{w}) + \lambda\|\bold{w}\|_2^2, \tag{28.5} The first term is the usual average binary cross entropy as defined in Equation 26.4. And, the second term above is just the sum of squares of the weights. This is called l_2-regularization or weight decay. The larger the value of \lambda>0, the more the weights are penalized for being “large” (deviating from the zero-mean prior),and thus the less flexible the model.

Commonly, C=\frac{1}{\lambda} is used to set the strenght of the regularizer.

l_2 Regularized

For a better assessment, let us plot training and test accuracies across different values of K.

C is a hyperparameter in the model. We will see later how to choose the best value of C using cross-validation.