1.2 What is Reinforcement Learning?

So now that we have answered the fundamental question of why. It’s time for the next big mystery:

What is Reinforcement Learning? 🤔

Good news! We can sum up the core idea of Reinforcement Learning in just one powerful sentence (Brunskill 2022):

Learning Optimal Sequential Decision-Making Under Uncertainty

But what exactly does that mean? Let’s break it down!

Learning

At its core, learning in Reinforcement Learning occurs through trial and error, where an agent refines its actions based on evaluative feedback from the environment.

Evaluative Feedback

Evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action.

Intuition: Learning through experience

Unlike both supervised/unsupervised learning which rely on instructive feedback through gradient based optimization.

Instructive Feedback

Instructive feedback indicates the correct action to take, independently of the action actually taken.

Intuition: Learning through ground truth

For example, supervised/unsupervised learning focus on identifying what makes an image a cheetah by learning patterns from a dataset of animal images. In contrast, Reinforcement Learning is about teaching a cheetah how to run by interacting with its environment (Lecture 10).

Supervised Learning “Here’s some examples (images), now learn patterns in these examples…”

Reinforcement Learning “Here’s an environment, now learn patterns by exploring it…”

Optimal

The goal of Reinforcement Learning is to maximize rewards over time by finding the best possible strategy. This involves seeking:

A Maximized discounted sum of rewards, or goal \(G\).
Optimal Value Functions \(V^{*}\).
Optimal Action-Value Functions \(Q^{*}\).
Optimal Policies \(\pi^{*}\).
A Balance between exploration vs. exploitation.

Sequential Decision-Making

Unlike a one-time choice, Reinforcement Learning involves a chain of decisions where each action affects the next.

\(\pi: S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, ... , S_{T-1}, A_{T-1}, R_{T}\)

Markov Decision Process (MDP) is a formal framework for modeling decision-making.
The agent selects actions over multiple time steps, shaping its future states and rewards.
Each decision affects not only immediate rewards but also the trajectory of future outcomes.