1.2 What is Reinforcement Learning?

So now that we have answered the fundamental question of why. It’s time for the next big mystery:

What is Reinforcement Learning? 🤔

Good news! We can sum up the core idea of Reinforcement Learning in just one powerful sentence (Brunskill 2022):

Learning Optimal Sequential Decision-Making Under Uncertainty

But what exactly does that mean? Let’s break it down!

Learning

At its core, learning in Reinforcement Learning occurs through trial and error, where an agent refines its actions based on evaluative feedback from the environment.

TipEvaluative Feedback

Evaluative feedback indicates how good the action taken was, but not whether it was the best or the worst action. Intuitively, this type of feedback can be thought of as learning through experience.

Example: Learning How to Walk

Learning to Walk

Learning to walk involves trial and error, where feedback comes from the outcome of each attempt—success or falling—rather than explicit instruction. This aligns with evaluative feedback in reinforcement learning, where the agent learns from the consequences of its actions, not direct guidance.

Example: Learning to Distinguish Right from Wrong

Learning to Distinguish Right from Wrong

Learning to distinguish right from wrong often relies on experiencing the outcomes of decisions and receiving approval or disapproval from others. This reflects evaluative feedback in reinforcement learning, where behavior is shaped by rewards or penalties rather than explicit rules.

Unlike both Supervised/Unsupervised Learning which rely on instructive feedback through gradient-based optimization.

TipInstructive Feedback

Instructive feedback indicates the correct action to take, independently of the action actually taken. Intuitively, this type of feedback can be thought of as learning from ground truth.

Supervised/Unsupervised Learning

Example: Cheetah

Supervised/Unsupervised learning focus on identifying what makes an image a cheetah by learning patterns from a dataset of animal images. In contrast, Reinforcement Learning is about teaching a cheetah how to run by interacting with its environment (Lecture 10).

Supervised Learning

“Here’s some examples (images), now learn patterns in these examples…”

Reinforcement Learning

“Here’s an environment, now learn patterns by exploring it…”

Optimal

The goal of Reinforcement Learning is to maximize rewards over time by finding the best possible strategy. This involves seeking:

  • A maximized discounted sum of rewards, or goal \(G\).
  • Optimal Value Functions \(V^{*}\).
  • Optimal Action-Value Functions \(Q^{*}\).
  • Optimal Policies \(\pi^{*}\).
  • A balance between exploration vs. exploitation \(\epsilon\).

Sequential Decision-Making

Unlike a one-time choice, Reinforcement Learning involves a chain of decisions where each action affects the next.

Markov Decision Process

\(\pi: S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, ... , S_{T-1}, A_{T-1}, R_{T}\)

  • Markov Decision Process (MDP) is a formal framework for modeling decision-making.
  • The agent selects actions \(A_t\) over multiple time steps, shaping its future states \(S_t\) and rewards \(R_t\).
  • Each decision affects not only immediate rewards but also the trajectory \(\tau\) of future outcomes.