11.1 K Nearest Neighbors

: 60 minutes

The K nearest neighbor (KNN) is the simplest kind of classifier. KNN is a non-parametric, supervised model that can be used for both binary and multiclass classification.

Non-parametric Classifiers

Exercise 30.1 Can you name another non-parametric classifier?

The idea behind KNN is very simple: to classify a new test data point with feature vector \bold{x}:

we find the K closest examples to \bold{x} in the training set \mathcal{D}, denoted \mathcal{N}_K(\bold{x}, \mathcal{D});
then, look at their labels to derive a posterior probability distribution over the output classes for the local region \mathcal{N}_K(\bold{x}, \mathcal{D}) around \bold{x}. In other words, p(y=c\mid \bold{x}, \mathcal{D})\coloneqq\frac{1}{K}\sum_{n\in\mathcal{N}_K(\bold{x}, \mathcal{D})}\mathbb{I}(y_n=c). \tag{30.1}

The main hyper-parameter here is K: the number of neighbors to poll.

Parametric vs Non-parametric

Exercise 30.2 Why do think KNN is non-parametric?

Soft vs Hard Labels

Exercise 30.3 Name a classifier that outputs hard labels instead of (conditional) class probabilities.

A Demo

Let us assume a classification task with two features \bold{x}=\{x_1, x_2\} and two classes \{{\color{green}\tt 0}, {\color{red}\tt 1}\}.

Some example scenarios are

spam detection: x_1,x_2 can be counts of two particular phrases or words with labels \{{\color{green}\tt 0}={\color{green}\tt INBOX}, {\color{red}\tt 1}={\color{red}\tt SPAM}\}
disease detection: x_1,x_2 can be measurements of the tumor with labels \{{\color{green}\tt 0}={\color{green}\tt NEG}, {\color{red}\tt 1}={\color{red}\tt POS}\}

We first generate our training and validation data \mathcal{D}_\text{train} and \mathcal{D}_\text{val} of size N combined.

viewof N = Inputs.range([1, 300], {value: 100, step: 1, label: tex`N`})
data = {
  const ntrain = parseInt(N * 0.8);
  const ntest = parseInt(N * 0.2);
  const points = [];
  const random = d3.randomNormal.source(d3.randomLcg(42));
  
  // Class A (green) - centered around (3, 3)
  for (let i = 0; i < ntrain * r; i++) {
    points.push({
      x: random(3, 1.5)(),
      y: random(3, 1.5)(),
      class: '0',
      id: i,
      train: true
    });
  }
  
  // Class B (red) - centered around (5, 5)
  for (let i = ntrain * r; i < ntrain; i++) {
    points.push({
      x: random(5, 1.5)(),
      y: random(5, 1.5)(),
      class: '1',
      id: i,
      train: true
    });
  }
  // Class A (green) - centered around (3, 3)
  for (let i = 0; i < ntest * r; i++) {
    points.push({
      x: random(3, 1.5)(),
      y: random(3, 1.5)(),
      class: '0',
      id: i,
      train: false
    });
  }
  
  // Class B (red) - centered around (5, 5)
  for (let i = ntest * r; i < ntest; i++) {
    points.push({
      x: random(5, 1.5)(),
      y: random(5, 1.5)(),
      class: '1',
      id: i,
      train: false
    });
  }
  
  // Class probabilities
  points.forEach(d => {
    d.proba_0 = parseFloat(findKNN(d, points.filter(d => d.train), k).filter(e => e.class == '0').length / k).toFixed(2);
    d.proba_1 = parseFloat(findKNN(d, points.filter(d => d.train), k).filter(e => e.class == '1').length / k).toFixed(2);
    d.pclass = d.proba_1 >= threshold ? '1' : '0';
    d.rowColor = d.class == d.pclass  ? 'green' : 'red';
});
  return points;
}

chart(showTest)

chart = function(showTest) {
  const width = 700;
  const height = 500;
  const margin = {top: 20, right: 20, bottom: 40, left: 50};
  
  let selectedPoint = null;
  let neighbors = [];
  
  const xScale = d3.scaleLinear()
    .domain([-1, 11])
    .range([margin.left, width - margin.right]);
  
  const yScale = d3.scaleLinear()
    .domain([-1, 11])
    .range([height - margin.bottom, margin.top]);
  
  const svg = d3.create("svg")
    .attr("width", width)
    .attr("height", height)
    .attr("viewBox", [0, 0, width, height])
    .style("cursor", "pointer");
  
  // Add grid
  svg.append("g")
    .attr("class", "grid")
    .attr("transform", `translate(0,${height - margin.bottom})`)
    .call(d3.axisBottom(xScale).tickSize(-height + margin.top + margin.bottom).tickFormat(""))
    .style("stroke", "#e0e0e0")
    .style("stroke-opacity", 0.3);
  
  svg.append("g")
    .attr("class", "grid")
    .attr("transform", `translate(${margin.left},0)`)
    .call(d3.axisLeft(yScale).tickSize(-width + margin.left + margin.right).tickFormat(""))
    .style("stroke", "#e0e0e0")
    .style("stroke-opacity", 0.3);
  
  // Add axes
  svg.append("g")
    .attr("transform", `translate(0,${height - margin.bottom})`)
    .call(d3.axisBottom(xScale))
    .append("text")
    .attr("x", width - margin.right)
    .attr("y", 35)
    .attr("fill", "black")
    .attr("font-size", "12px")
    .text("x_1 →");
  
  svg.append("g")
    .attr("transform", `translate(${margin.left},0)`)
    .call(d3.axisLeft(yScale))
    .append("text")
    .attr("transform", "rotate(-90)")
    .attr("y", -35)
    .attr("x", -margin.top)
    .attr("fill", "black")
    .attr("font-size", "12px")
    .text("x_2 →");
  
  // Container for lines
  const linesGroup = svg.append("g").attr("class", "lines");
  
  // Container for circles
  const circlesGroup = svg.append("g").attr("class", "circles");
  
  // Container for rectangles
  const rectanglesGroup = svg.append("g").attr("class", "rectangles");

  function update() {
    // Update lines
    const neighborIds = new Set(neighbors.map(n => n.id));
    
    const lines = linesGroup.selectAll("line")
      .data(selectedPoint ? neighbors : []);
    
    lines.enter()
      .append("line")
      .merge(lines)
      .attr("x1", d => xScale(selectedPoint.x))
      .attr("y1", d => yScale(selectedPoint.y))
      .attr("x2", d => xScale(d.x))
      .attr("y2", d => yScale(d.y))
      .attr("stroke", "#999")
      .attr("stroke-width", 2)
      .attr("stroke-dasharray", "5,5");
    
    lines.exit().remove();
    
    // Update circles
    const circles = circlesGroup.selectAll("circle")
      .data(data.filter(d => d.train), d => d.id);
    
    // Update rectangles
    const rectangles = rectanglesGroup.selectAll("rectangle")
      .data(data.filter(d => !d.train), d => d.id);
    
    if(showTest) {
    rectangles.enter()
      .append("rect")
      .merge(rectangles)
      .attr("x", d => xScale(d.x))
      .attr("y", d => yScale(d.y))
      .attr("width", d => {
        //if (selectedPoint && d.id === selectedPoint.id) return 20;
        return 10;
      })
      .attr("height", d => {
        //if (selectedPoint && d.id === selectedPoint.id) return 20;
        return 10;
      })
      .attr("fill", d => {
        //if (selectedPoint && d.id === selectedPoint.id) return "#ffd900b8";
        return d.class === '0' ? '#29944eff' : '#e24a4aff';
      })
      .attr("stroke", d => {
        //if (selectedPoint && d.id === selectedPoint.id) return "#00000088";
        return "#fff";
      })
      .attr("stroke-width", d => {
        //if (selectedPoint && d.id === selectedPoint.id) return 3;
        return 1;
      })
      .attr("opacity", 1)
      .on("click", function(event, d) {
        selectedPoint = d;
        neighbors = findKNN(d, data.filter(d=>d.train), k);
        update();
        // Trigger info update
        //svg.node().value = {point: selectedPoint, neighbors: neighbors};
        //svg.node().dispatchEvent(new CustomEvent("input", {bubbles: true}));
      });
    }
    circles.enter()
      .append("circle")
      .merge(circles)
      .attr("cx", d => xScale(d.x))
      .attr("cy", d => yScale(d.y))
      .attr("r", d => {
        if (neighborIds.has(d.id)) return 7;
        return 5;
      })
      .attr("fill", d => {
        //if (neighborIds.has(d.id)) return "#FFA500";
        return d.class === '0' ? '#29944f8a' : '#e24a4a7f';
      })
      .attr("stroke", d => {
        if (neighborIds.has(d.id)) return "#333";
        return "#fff";
      })
      .attr("stroke-width", d => {
        if (neighborIds.has(d.id)) return 3;
        return 1;
      })
      .attr("opacity", d => {
        if (neighborIds.has(d.id)) return 0.9;        
        return 0.5});
  }
  
  update();
  
  // Listen for k changes
  svg.node().update = update;
  //svg.node().value = {point: selectedPoint, neighbors: neighbors};
  
  return svg.node();
}
viewof showTest = Inputs.toggle({label: 'Show Test', value: false});

Exercise 30.4 Do you think this data represent the problem of spam detection?

Play around the with the class proportion r=\frac{\#\tt 0}{\#\tt 1}.

viewof r = Inputs.range([0, 1], {value: 0.5, step: 0.1, label: tex`r`})

Applying KNN

In order to create our KNN classifier, we need to pick an integer value for K\geq1. Note that K is a hyperparameter.

Hyperparameter Tuning

Exercise 30.5 How is a hyperparameter tuned or chosen in a model?

To apply KNN, we first choose K:

viewof k = Inputs.range([1, 20], {value: 5, step: 1, label: tex`K`})

chart(true)

function distance(p1, p2) {
  return Math.sqrt(Math.pow(p1.x - p2.x, 2) + Math.pow(p1.y - p2.y, 2));
}

// Find k nearest neighbors
function findKNN(point, allData, k) {
  return allData
    .filter(d => d.id !== point.id)
    .map(d => ({ ...d, dist: distance(point, d) }))
    .sort((a, b) => a.dist - b.dist)
    .slice(0, k);
}

// Watch for k changes and update
kWatcher = {
  if (chart && chart.update) {
    chart.update();
  }
  return k;
}

// Get selection from chart
selection = chart ? chart.value : {point: null, neighbors: []}

Now, click on a test point to see the K= nearest neighbors in the training set.

Class Probabilities

For each test point, we get the estimated class probability p({\tt 1}\mid\bold{x}) by computing the ratio of numbers of \color{red}\tt red over \color{green}\tt green neighbors.

Finally, we predict the (hard) label by choosing a threshold. In case threshold (\tau) equals , we get the following prediction.

import {colorTable} from "@mootari/colored-table-rows";
colorTable(data.filter(d => !d.train), {
  columns: [
    "class",
    "proba_0",
    "proba_1",
    "pclass"
],
  colorOpacity: 0.3,
  colorColumn: "rowColor",
  header: {
    class: "class label",
    proba_0: "p(0|x)",
    proba_1: "p(1|x)",
    pclass: "predicted label",
  }
})

We will discuss in the next subsection the consequence of choosing a different threshold.

Choosing K

The KNN classifier has K as the only hyperparameter. As discussed last week, we use either validation or cross-validation to choose an appropriate value of K that minimizes the validation error on \mathcal{D}_\text{val}.

{
  const K_list = d3.range(0, 15).map(k => {
    const test = data.filter(d => !d.train);
    const train = data.filter(d => d.train);
    const err = test.filter(d => (d.class == '1' && findKNN(d, train, k).filter(e => e.class == '0').length / k >= 0.5) || (d.class == '0' &&
    findKNN(d, train, k).filter(e => e.class == '0').length / k < 0.5)
    );
    return {K: k, erate: err.length / test.length};
  });
  return Plot.plot({
  x: {label: 'K', domain: K_list.map(d => d.K) },
  y: {label: 'Val Error', domain: [0, 0.5]},
  grid: true,
  marks: [
    Plot.line(K_list, {
      x: "K", 
      y: "erate",
      stroke: 'red'
    })
  ]
});
}

Choosing the Threshold

For any fixed threshold \tau, we label the predicted class by considering the following decision rule: \hat{y}_\tau(\bold{x})\coloneqq \begin{cases} {\color{red}\tt 1}&\text{ if } p({\tt 1}\mid\bold{x}) \geq \tau \\ {\color{green}\tt 0}&\text{ if } p({\tt 1}\mid\bold{x}) < \tau \end{cases} In most cases, the common choice for \tau is 0.5. However, different values of \tau can be when the default is not deemed ideal.

Change the default \tau below to see the changes inflicted on accuracy, precision, recall, etc.

viewof threshold = Inputs.range([0, 1], {
  label: "Threshold",
  value: 0.5,
  step: 0.1
})

{
  const width = 500;
  const height = 500;
  const margin = 120;
  const tp = data.filter((d) => !d.train && d.class == '1' && d.pclass == '1').length;
  const fn = data.filter((d) => !d.train && d.class == '1' && d.pclass == '0').length;
  const fp = data.filter((d) => !d.train && d.class === '0' && d.pclass == '1').length;
  const tn = data.filter((d) => !d.train && d.class == '0' && d.pclass == '0').length;
  const svg = d3.create("svg")
    .attr("viewBox", `0 0 ${width} ${height}`)
    .attr("width", width)
    .attr("height", height);
  
  const boxSize = (width - 2 * margin) / 3;
  
  // Title
  svg.append("text")
    .attr("x", width / 2)
    .attr("y", 50)
    .attr("text-anchor", "middle")
    .attr("font-size", "35px")
    .text("Predicted Class");
  
  // Predicted labels (top)
  svg.append("text")
    .attr("x", margin + boxSize / 2)
    .attr("y", 110)
    .attr("text-anchor", "middle")
    .attr("font-size", "25px")
    .html("A&#770;");
  
  svg.append("text")
    .attr("x", margin + boxSize * 1.5)
    .attr("y", 110)
    .attr("text-anchor", "middle")
    .attr("font-size", "25px")
    .html("B&#770;");
  
  svg.append("text")
    .attr("x", margin + boxSize * 2.5)
    .attr("y", 110)
    .attr("text-anchor", "middle")
    .attr("font-size", "25px")
    .html("Rec.");
  
  // Actual Class label (left, rotated)
  svg.append("text")
    .attr("x", -height / 2)
    .attr("y", 35)
    .attr("text-anchor", "middle")
    .attr("font-size", "35px")
    .attr("transform", `rotate(-90)`)
    .text("Actual Class");
  
  // Actual labels (left)
  svg.append("text")
    .attr("x", 85)
    .attr("y", margin + boxSize / 2 + 10)
    .attr("text-anchor", "middle")
    .attr("font-size", "30px")
    .text("A");
  
  svg.append("text")
    .attr("x", 85)
    .attr("y", margin + boxSize * 1.5 + 10)
    .attr("text-anchor", "middle")
    .attr("font-size", "30px")
    .text("B");
  
  svg.append("text")
    .attr("x", 85)
    .attr("y", margin + boxSize * 2.5 + 10)
    .attr("text-anchor", "middle")
    .attr("font-size", "25px")
    .text("Prec.");

  // Boxes
  const boxes = [
    { x: 0, y: 0, label: ["(TP)", tp] },
    { x: 1, y: 0, label: ["(FN)", fn] },
    { x: 2, y: 0, label: [`(r_1)`, parseFloat(tp/(tp+fn)).toFixed(2)] },
    { x: 0, y: 1, label: ["(FP)", fp] },
    { x: 1, y: 1, label: ["(TN)", tn] },
    { x: 2, y: 1, label: [`(r_0)`, parseFloat(tn/(fp+tn)).toFixed(2)] },
    { x: 0, y: 2, label: [`(p_1)`, parseFloat(tp/(tp+fp)).toFixed(2)] },
    { x: 1, y: 2, label: [`(p_0)`, parseFloat(tn/(fn+tn)).toFixed(2)] },
    { x: 2, y: 2, label: ["Acc", parseFloat((tp+tn)/(tp+fn+fp+tn)).toFixed(2)] },
  ];
  
  boxes.forEach(box => {
    const g = svg.append("g");
    
    // Rectangle
    g.append("rect")
      .attr("x", margin + box.x * boxSize)
      .attr("y", margin + box.y * boxSize)
      .attr("width", boxSize)
      .attr("height", boxSize)
      .attr("fill", "none")
      .attr("stroke", "gray")
      .attr("stroke-width", 2);
    
    // Text
    const textX = margin + box.x * boxSize + boxSize / 2;
    const baseY = margin + box.y * boxSize + boxSize / 2 - 18;
    
    box.label.forEach((line, i) => {
      g.append("text")
        .attr("x", textX)
        .attr("y", baseY + i * 40)
        .attr("text-anchor", "middle")
        .attr("font-size", "25px")
        .text(line);
    });
  });
  
  return svg.node();
}

Figure 30.1: The Confusion Matrix

Precision vs Recall

Exercise 30.6 Which should be more prioritized in spam detection and cancer diagnosis?

Consider a scenario where a predictive model is being deployed to assist physicians in detecting tumors. In this setting, physicians will most likely be interested in identifying all patients with cancer and not missing anyone with cancer so that they can provide them with the right treatment.

In other words, physicians prioritize achieving a high recall rate. This emphasis on recall comes, of course, with the trade-off of potentially more false-positive predictions, reducing the precision of the model. That is a risk physicians are willing to take because the cost of a missed cancer is much higher than the cost of further diagnostic tests. Consequently, when it comes to deciding whether to classify a patient as having cancer or not, it may be more beneficial to classify them as positive for cancer when the conditional probability estimate is much lower than 0.5.

Receiver Operating Characteristic (ROC)

A receiver operating characteristic curve, or ROC curve, is a graph that illustrates the performance of a binary classifier model at varying threshold values. ROC analysis is commonly applied in the assessment of diagnostic test performance in clinical epidemiology.

The ROC curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941, which led to its name (“receiver operating characteristic”).

The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold \tau.

ROC = {
  const t_list = d3.range(0, 1, 0.1).map(t => {
    const test = data.filter(d => !d.train);
    const train = data.filter(d => d.train);
    const tp = test.filter((d) => d.class == '1' && d.proba_1 >= t).length;
    const fn = test.filter((d) => d.class == '1' && d.proba_1 < t).length;
    const fp = test.filter((d) => d.class == '0' && d.proba_1 >= t).length;
    const tn = test.filter((d) => d.class == '0' && d.proba_1 < t).length;

    return {t: t, fpr: parseFloat(parseFloat(fp/(fp+tn)).toFixed(2)), tpr: parseFloat(parseFloat(tp/(tp+fn)).toFixed(2))};
  });
  t_list.push({fpr: 0, tpr: 0});
  return Plot.plot({
  x: {label: 'False positive rate', ticks: 10},
  y: {label: 'True positive rate'},
  grid: true,
  width: 500,
  height: 500,
  marks: [
    Plot.line(t_list, {
      x: "fpr", 
      y: "tpr",
      stroke: 'green'
    }),
    Plot.text(t_list.filter(d => d.t == 0.5), {x: "fpr", y: "tpr", text: (d) => d.t, lineAnchor: "bottom", dy: -10}),
    Plot.line([[0,0], [1,1]], {strokeDasharray: "5,5", stroke: 'gray'})
  ]
});
}

Area Under Curve (AUC)

A point summary of ROC is the area under curve (AUC).

AUC

Exercise 30.7 Compute the AUC.