11.1 K Nearest Neighbors


: 60 minutes

The K nearest neighbor (KNN) is the simplest kind of classifier. KNN is a non-parametric, supervised model that can be used for both binary and multiclass classification.

Tip Non-parametric Classifiers

Exercise 30.1 Can you name another non-parametric classifier?

The idea behind KNN is very simple: to classify a new test data point with feature vector \bold{x}:

  1. we find the K closest examples to \bold{x} in the training set \mathcal{D}, denoted \mathcal{N}_K(\bold{x}, \mathcal{D});

  2. then, look at their labels to derive a posterior probability distribution over the output classes for the local region \mathcal{N}_K(\bold{x}, \mathcal{D}) around \bold{x}. In other words, p(y=c\mid \bold{x}, \mathcal{D})\coloneqq\frac{1}{K}\sum_{n\in\mathcal{N}_K(\bold{x}, \mathcal{D})}\mathbb{I}(y_n=c). \tag{30.1}

The main hyper-parameter here is K: the number of neighbors to poll.

Tip Parametric vs Non-parametric

Exercise 30.2 Why do think KNN is non-parametric?

Tip Soft vs Hard Labels

Exercise 30.3 Name a classifier that outputs hard labels instead of (conditional) class probabilities.

A Demo

Let us assume a classification task with two features \bold{x}=\{x_1, x_2\} and two classes \{{\color{green}\tt 0}, {\color{red}\tt 1}\}.

Some example scenarios are

  1. spam detection: x_1,x_2 can be counts of two particular phrases or words with labels \{{\color{green}\tt 0}={\color{green}\tt INBOX}, {\color{red}\tt 1}={\color{red}\tt SPAM}\}

  2. disease detection: x_1,x_2 can be measurements of the tumor with labels \{{\color{green}\tt 0}={\color{green}\tt NEG}, {\color{red}\tt 1}={\color{red}\tt POS}\}

We first generate our training and validation data \mathcal{D}_\text{train} and \mathcal{D}_\text{val} of size N combined.

Tip

Exercise 30.4 Do you think this data represent the problem of spam detection?

Play around the with the class proportion r=\frac{\#\tt 0}{\#\tt 1}.

Applying KNN

In order to create our KNN classifier, we need to pick an integer value for K\geq1. Note that K is a hyperparameter.

Tip Hyperparameter Tuning

Exercise 30.5 How is a hyperparameter tuned or chosen in a model?

To apply KNN, we first choose K:

Now, click on a test point to see the K= nearest neighbors in the training set.

NoteClass Probabilities

For each test point, we get the estimated class probability p({\tt 1}\mid\bold{x}) by computing the ratio of numbers of \color{red}\tt red over \color{green}\tt green neighbors.

Finally, we predict the (hard) label by choosing a threshold. In case threshold (\tau) equals , we get the following prediction.

We will discuss in the next subsection the consequence of choosing a different threshold.

Choosing K

The KNN classifier has K as the only hyperparameter. As discussed last week, we use either validation or cross-validation to choose an appropriate value of K that minimizes the validation error on \mathcal{D}_\text{val}.

Choosing the Threshold

For any fixed threshold \tau, we label the predicted class by considering the following decision rule: \hat{y}_\tau(\bold{x})\coloneqq \begin{cases} {\color{red}\tt 1}&\text{ if } p({\tt 1}\mid\bold{x}) \geq \tau \\ {\color{green}\tt 0}&\text{ if } p({\tt 1}\mid\bold{x}) < \tau \end{cases} In most cases, the common choice for \tau is 0.5. However, different values of \tau can be when the default is not deemed ideal.

Change the default \tau below to see the changes inflicted on accuracy, precision, recall, etc.

Figure 30.1: The Confusion Matrix
Tip Precision vs Recall

Exercise 30.6 Which should be more prioritized in spam detection and cancer diagnosis?

Consider a scenario where a predictive model is being deployed to assist physicians in detecting tumors. In this setting, physicians will most likely be interested in identifying all patients with cancer and not missing anyone with cancer so that they can provide them with the right treatment.

In other words, physicians prioritize achieving a high recall rate. This emphasis on recall comes, of course, with the trade-off of potentially more false-positive predictions, reducing the precision of the model. That is a risk physicians are willing to take because the cost of a missed cancer is much higher than the cost of further diagnostic tests. Consequently, when it comes to deciding whether to classify a patient as having cancer or not, it may be more beneficial to classify them as positive for cancer when the conditional probability estimate is much lower than 0.5.

Read more here.

Receiver Operating Characteristic (ROC)

A receiver operating characteristic curve, or ROC curve, is a graph that illustrates the performance of a binary classifier model at varying threshold values. ROC analysis is commonly applied in the assessment of diagnostic test performance in clinical epidemiology.

The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941, which led to its name (“receiver operating characteristic”).

The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold \tau.

Area Under Curve (AUC)

A point summary of ROC is the area under curve (AUC).

Tip AUC

Exercise 30.7 Compute the AUC.