8.1 Machine Learning

: 45 minutes

(James et al. 2021) Page 15 (machine learning introduction), page 26 (types of learning), page

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r and Python. 2nd ed. Springer Texts in Statistics. Springer. https://doi.org/10.1007/978-1-0716-1418-1.

A well-established definition of machine learning is the following:

A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Suppose we observe a quantitative response variable Y and p different predictors X_1, X_2, \dots, X_p. We assume that there exists an underlying relationship between Y and the predictors, which can be expressed as:

Y = f(X) + \varepsilon

Here, f is an unknown function that captures the systematic relationship between the predictors X and the response Y, while \varepsilon is a random error term that is independent of X and has mean zero.

This formulation reflects the assumption that, after completing EDA, our goal is to estimate the function \hat{f} to understand how X influences Y.

Prediction

In many situations, the predictor variables X are easy to observe, but the response Y is difficult, expensive, or time-consuming to obtain. In such cases, our main objective is prediction rather than interpretation.

Because the error term \varepsilon has mean zero, we can use an estimate \hat{f} of the true function f to make predictions:

\hat{Y} = \hat{f}(X)

Here, the focus is not on the exact form of \hat{f}, but rather on its ability to produce accurate predictions. The function \hat{f} may act as a black box, where interpretability is secondary to predictive performance.

Inference

In other cases, our goal is not to predict Y, but rather to understand the relationship between the response and the predictors X_1, \dots, X_p. This is the task of inference.

Here, we still estimate f, but we cannot treat \hat{f} as a black box — we need to understand its exact form in order to draw meaningful conclusions.

Typical questions of interest include:

Which predictors are associated with the response?
What is the nature of the relationship between each predictor and Y?
Is this relationship approximately linear, or does it require a more complex form?

In this setting, the emphasis is on interpretability rather than pure predictive accuracy.

Types of Learning

Most statistical learning problems fall into one of two categories:

Figure: Types of Learning in Machine Learning

Supervised Learning

In the supervised learning setting, for each observation i = 1, \dots, n, we observe a predictor measurement \mathbf{x}_i and an associated response value y_i.

This setup, where both inputs and outputs are observed, is the primary focus of this book. The goal is to learn a function that maps \mathbf{x}_i to y_i, either for prediction or inference.

Unsupervised Learning

In unsupervised learning, for each observation i = 1, \dots, n, we observe a vector of features \mathbf{x}_i, but there is no associated response y_i.

This makes the task more challenging, as the goal is to uncover hidden structure or patterns in the data.

Reinforcement Learning

There is also a third category called Reinforcement Learning (RL), which is beyond the scope of this class (though it is taught in the GWU Data Science program). Reinforcement learning is concerned with learning from trial and error to make decisions over time.

Unlike traditional types of learning, RL involves learning how to act in uncertain environments where outcomes are delayed and depend on sequences of actions. The goal is to learn an optimal policy for decision-making that maximizes cumulative reward.

Types of Tasks

The tasks \mathcal{T} covered in this class, based on the machine learning types introduced earlier, are:

Figure: Types of Tasks in Machine Learning

Regression

A regression task involves predicting a quantitative (numerical) outcome based on one or more input variables. For example, predicting someone’s income based on their age, education, and job title.

The goal is to model the relationship between predictors and a continuous response.

Y \approx f(X) + \varepsilon

Y is the quantitative response, X represents the predictor(s), f(X) is an unknown function capturing the relationship between X and Y, and \varepsilon is the irreducible error term.

Classification

A classification task involves predicting a qualitative (categorical) outcome, such as assigning a label or class to an observation. For instance, predicting whether an email is spam or not, or diagnosing a patient as having a specific type of cancer.

The goal is to determine which category an observation belongs to.

Y \in \{1, 2, \dots, K\}, \quad \mathbb{P}(Y = k \mid X) = f_k(X)

Y is a categorical response variable taking values in a set of K discrete classes. X represents the predictor(s). The model estimates class probabilities f_k(X) = \mathbb{P}(Y = k \mid X) for each class k \in \{1, 2, \dots, K\}.

The final prediction \hat{Y} is typically assigned to the class with the highest estimated probability:

\hat{Y} = \arg\max_k \mathbb{P}(Y = k \mid X)

This is known as the Bayes classifier when using the true underlying probabilities.

Clustering

A clustering task involves identifying groups or clusters of similar observations without knowing any response labels. The goal is to discover structure in the data based only on the predictors.

For instance, clustering might be used to segment customers by purchasing behavior, group articles by topic, or detect communities in a network.

\mathbf{x}_1, \dots, \mathbf{x}_n \quad \rightarrow \quad \text{Cluster Assignments } \{1, 2, \dots, K\}

There is no response variable Y in clustering. We are given only feature vectors \mathbf{x}_i, and the task is to assign each observation to one of K clusters based on similarity.

The clustering algorithm groups observations so that those within the same cluster are more similar (according to some distance or similarity metric) than those in different clusters.