8.2 Linear Regression (Ordinary Least Squares)


: 45 minutes

The (ordinary) least squares or simply linear regression is a popular regression technique to fit a line (more generally a hyperplane1) to a collection of (possibly noisy) data points.

1 A hyperplane generalizes the notion of a two-dimensional plane in three-dimensional space to arbitrary dimension.

To recall, we denote by y\in\mathbb{R} our response and by \bold{x}=(x_1, x_2, \ldots, x_D) our D many features or predictors. Linear regression model is parametric: there is a fixed (irrespective of the training size) number of parameters in the model. On the contrary, the number of model parameters in a non-parametric model grows with the size of the training data.

Mean Squared Error (MSE)

In linear regression, we assume that our guess mapping \widehat{f} is a linear function of the inputs. In other words, for any input \bold{x} our guess output: \widehat{f}(\bold{x})\coloneqq w_0+\bold{w}^T\bold{x}=w_0+\sum_{j=1}^D w_jx_j. Here, the intercept w_0 and slopes \bold{w}=(w_1, w_2, \ldots, w_D) are called the model weights or coefficients. They are the only parameters \pmb{\theta}=(w_0, \bold{w}) of the model.

One can randomly choose the coefficients, yet better, we can let our regression algorithm pick them for us so that the residual error2 is minimized across our training data.

2 discrepancy between our guess and the true response: (y-\bold{w}^T\bold{x})

The overall error for a training data \mathcal{D}=\{(\bold{x}_i, y_i)\}_{i=1}^N is called the mean squared error (MSE): \mathrm{MSE}(\pmb{\theta})\coloneqq\frac{1}{N}\sum_{i=1}^N[y_n-(w_0+\bold{w}^T\bold{x})] The linear regression algorithm (discussed in Machine Learning I next semester) finds the optimal weights \pmb{\widehat{\theta}} that minimizes the MSE across all choices of the parameters.

TipA Motivating Demo

We present a univariate (only one predictor), motivating example using the famous Palmer Penguins dataset. The dataset contains n=344 data points. There are 3 different species of penguins in this dataset, collected from 3 islands in the Palmer Archipelago, Antarctica.

Among many variable available in the data, we select body_weight as our only predictor and flipper_length_mm as our response variable. A scatter plot is shown below.

Overfitiing

When fitting highly flexible (complex) models, one needs to be mindful of overfitting the data. Overfitting is a phenomenon where try to model every minor variation in our data too tightly. Since our random data may have noise, overfitting may model the noise more than the true signal.

Linear regression is less flexible than polynomial regression, where we fit a high degree polynomial through the data. If the degree is too high, the polynomial curve may, in fact, perfectly fit the data with \mathrm{MSE}=0. However, such a model may not generalize well to unseen data.

Linear vs Polynomial Regression