9.1 Binary Logistic Regression

The binary classification problem assumes that the output set \mathcal{Y} has two order, discrete labels or classes: \mathcal{Y}=\{\tt 0, \tt 1\}; and, each label of each sample belongs to exactly one of them. Examples include predicting

We use Binary Logistic Regression (a.k.a., Sigmoid Regression) to solve the binary classification problem.

TipFeature Vector

We assume there are d predictors (x_1, x_2, \ldots, x_d). For notational convenience with weights, we preppend 1 to the vector when denoting our feature vector, i.e., \bold{x}=(1, x_1, x_2, \ldots, x_d).

In case d=1, the problem is called univariate.

Model Assumptions

Recall that the underlying rule f is unknown to us, much like a “blackbox”. In binary logistic regression, we make two assumptions about the blackbox:

  1. the conditional probability p(y = {\tt 1}\mid \bold{x}) follows the Bernoulli distribution1 with mean \mu(\bold{x});

  2. the conditional mean \mu(\bold{x}) is the sigmoid2 of a linear function of \bold{x}, i.e., \mu(\bold{x})=\frac{1}{1+\exp(\bold{w}^T\bold{x})}.

1 A random variable Z with two outputs \{\tt 0, 1\} is said to follow the Bernoulli distribution with mean \mu\in[0,1] if its probability density function: p(z)=\begin{cases}\mu,&\text{ if } z={\tt 1}\\ 1-\mu,&\text{ if }z={\tt 0} \end{cases}

2 the sigmoid (S-shaped) function is denoted by \mathrm{sigm}(t)\colon\mathbb{R}\to\mathbb{R}. It maps the entire real line to the interval [0,1]; see Figure 26.1

The (d+1) weights w_0, w_1, \ldots, w_d are the parameters \pmb{\theta} of the model. Since the number of parameters does not grow with the size of the data, this is a parametric model.

Figure 26.1
TipModel Distribution

Since, we are making a model assumption in terms of the conditional probability for the first time in this course, it deserves an explanation. For simplicity, assume the univariate case (d=1) with just one predictor, say x.

Mathematical Formulation

We now describe the mathematical formulation for the multivariate case with d\geq1 predictors x_1, x_2, \ldots, x_d.

  • Feature vector: \bold{x}=(1, x_1, \ldots, x_d)
  • Binary output: y={\tt 0}\text{ or }{\tt 1}
  • Model weights: \bold{w}=(w_0, w_1, \ldots, w_d)
  • w_0 is called the bias or intercept

As per our model assumption, we have

p({\tt 1}\mid\bold{x}; \bold{w})=\tfrac{1}{1+\exp(\bold{w}^T\bold{x})} =\tfrac{1}{1+\exp(w_0+w_1x_1+\ldots+w_dx_d)} \tag{26.1}

Prediction

Given a dataset \mathcal{D}=\{(\bold{x}_i, y_i)\}_{i=1}^N of size N from the blackbox f, we learn an approximation \hat{f} by learning the weights \bold{w}.

Here, each (boldfaced) input \bold{x}_i represents a vector (1, x_{i1}, x_{i2}, \ldots, x_{id}) and corresponding output y_i is either \tt 1 or \tt 0.

If \bold{w}^*=(w_0^*, w_1^*, \ldots, w_d^*) denotes our learned weights, we can write our learned probability distribution \hat{p}({\tt 1}\mid \bold{x};\bold{w}^*) =\frac{1}{1+\exp(w^*_0+w^*_1x_1+\ldots+w^*_dx_d)} \tag{26.2}

Then, we can predict the labeled or class of new data point with feature vector \bold{x} using the following threshold function: \hat{y}=\begin{cases} {\tt 1}&\text{ if }\hat{p}({\tt 1}\mid\bold{x};\bold{w}^*)\geq0.5 \\ {\tt 0}&\text{ if }\hat{p}({\tt 1}\mid\bold{x};\bold{w}^*)<0.5. \end{cases}

Optimization

Now, we turn our attention to finding the best or optimal weights, w.r.t. some notion of error. We use maximum likelihood estimator (MLE) to maximize the odds or probability of the occurrence of the given data \mathcal{D}.

Metrics for Classification

In regression, we used metrics like the Mean Squared Error (MSE) and \mathcal{R}^2 as metrics for “good of fit”. In classification, we use a different set of metrics:

  • Confusion Matrix
  • Accuracy
  • Precision, Recall, and F1-score

Confusion Matrix

In binary classification problem, the confusion matrix turns out be a 2\times2 matrix, which capture to what degree a classifier misclassifies the (training/test) data.

Using the convention that \tt 1=POSITIVE and \tt 0=NEGATIVE, we define:

  • \text{\color{green}True Positive} (TP): (as desired) an observation is classified by the classifier as POSITIVE while its original label was POSITIVE;
  • \text{\color{green}True Negative} (TN): (as desired) an observation is classified by the classifier as NEGATIVE while its original label was NEGATIVE;
  • \text{\color{red}False Positive} (FP): an observation is (mis)classified by the classifier as POSITIVE while its original label was NEGATIVE;
  • \text{\color{red}False Negative} (FN): an observation is (mis)classified by the classifier as NEGATIVE while its original label was POSITIVE;

Since the above cases are exhaustive, we have the following Venn diagram:

The Venn Diagram
Figure 26.2

A yet better representation scheme is reporting the number of observations that fall into the above four cases as a matrix: the confusion matrix.

Figure 26.3: The Confusion Matrix
It’s evident from Figure 26.3 above that \begin{aligned} P &= TP + FN\\ N &= FP + TN\\ \hat P &=TP + FP\\ \hat N &=TN + FN. \end{aligned}
NoteAccuracy

The accuracy measures the rate of the correct classification: acc = \frac{TP+TN}{TP+TN+FP+FN} \tag{26.3}

For imbalanced data, it’s recommended to class-specific precision as a more robust performance metric for a classifier.

NotePrecision

The precision of a class measures the rate of the correct classification among labels classified as the class: p_{\tt 1} = \frac{TP}{TP+FP}\text{ and } p_{\tt 0} = \frac{TN}{TN+FN}. \tag{26.4}

NoteRecall

The recall of a class measures the rate of the correct classification among true labels in that class: r_{\tt 1} = \frac{TP}{TP+FN}\text{ and } r_{\tt 0} = \frac{TN}{TN+FP} \tag{26.5}

NoteF1

The F1 score of the positive and negative classes are defined, respectively, as follows: f_{\tt 1} = \frac{2p_{\tt 1}r_{\tt 1}}{p_{\tt 1}+r_{\tt 1}}\text{ and } f_{\tt 0} = \frac{2p_{\tt 0}r_{\tt 0}}{p_{\tt 0}+r_{\tt 0}}. \tag{26.6}

Note Confusion Matrix

Compute the accuracy of a model that renders the following confusion matrix: \begin{bmatrix} 22 & 10 \\ 5 & 25 \end{bmatrix}.

Possible only when d\neq0. In that case, we can write \bold{y}=\frac{1}{d}(\bold{z}-c\bold{x}).

An Example

Your Classifier