3.2 ε-Greedy

Can you stay curious without straying too far from what works? 🎯

We are exploring when we randomly select an action.

We are exploiting, or acting greedy, when an action is selected based on its expected value.

WarningProblem

Exploring all of the time does not permit you to exploit your knowledge of expected values.

Exploiting all of the time does not permit you to explore all of the options.

How can we select actions with the highest expected value while leveraging exploration?

NoteReal Life Example 🧠

Suppose you are in a Multi-Armed Bandit scenario:

  • \(S\) – You are hungry and want to treat yourself to a restaurant meal.
  • \(A_{1,\dots,k}\) – You can choose from \(k\) different restaurants in your area.
  • \(R\) – After eating, you rate your experience — maybe based on taste, service, or price satisfaction.

Action \(A_1\): Choosing Arepas. Reward \(R\): Based on past visits, this option seems to give the highest satisfaction — taste, service, and value are consistently strong.

Action \(A_2\): Choosing Chipotle. Reward \(R\): A familiar option with decent satisfaction, though not as rewarding as Arepas on average.

Action \(A_3\): Trying Falafel Bowls. Reward \(R\): An unexplored option — the reward is uncertain until you try it.

How can you select a restaurant meal that maximizes your enjoyment while still being open to exploring something new like Falafel Bowls?

Epsilon Greedy is an algorithm that allows us to balance our decision-making in this simple manner.

Epsilon (\(\epsilon\)) is a fixed proportion that decides whether we explore or exploit our actions.

\[ A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \arg\max_a Q(a) \text{ with probability } 1 - \epsilon \end{cases} \]

In our real life example, at first, you try different restaurants (exploration) to see what’s good. Over time, you start favoring the ones with better rewards (exploitation). But occasionally, you still try a new one — just in case it’s better than your current favorite.


Pseudocode