1.2 What is Reinforcement Learning?
Good news! We can sum up the core idea of Reinforcement Learning in just one powerful sentence (Brunskill 2022):
Learning Optimal Sequential Decision-Making Under Uncertainty
But what exactly does that mean? Let’s break it down!
Learning
At its core, learning in Reinforcement Learning occurs through trial and error, where an agent refines its actions based on evaluative feedback from the environment.
Intuition: Learning through experience
Unlike both supervised/unsupervised learning which rely on instructive feedback through gradient based optimization.
Intuition: Learning through ground truth
For example, supervised/unsupervised learning focus on identifying what makes an image a cheetah by learning patterns from a dataset of animal images. In contrast, Reinforcement Learning is about teaching a cheetah how to run by interacting with its environment (Lecture 10).
“Here’s some examples (images), now learn patterns in these examples…”
“Here’s an environment, now learn patterns by exploring it…”
Optimal
The goal of Reinforcement Learning is to maximize rewards over time by finding the best possible strategy. This involves seeking:
- A Maximized discounted sum of rewards, or goal \(G\).
- Optimal Value Functions \(V^{*}\).
- Optimal Action-Value Functions \(Q^{*}\).
- Optimal Policies \(\pi^{*}\).
- A Balance between exploration vs. exploitation.
Sequential Decision-Making
Unlike a one-time choice, Reinforcement Learning involves a chain of decisions where each action affects the next.
\(\pi: S_{0}, A_{0}, R_{1}, S_{1}, A_{1}, R_{2}, ... , S_{T-1}, A_{T-1}, R_{T}\)
- Markov Decision Process (MDP) is a formal framework for modeling decision-making.
- The agent selects actions over multiple time steps, shaping its future states and rewards.
- Each decision affects not only immediate rewards but also the trajectory of future outcomes.