3.2 ε-Greedy
We are exploring when we randomly select an action.
We are exploiting, or acting greedy, when an action is selected based on its expected value.
Exploring all of the time does not permit you to exploit your knowledge of expected values.
Exploiting all of the time does not permit you to explore all of the options.
How can we select actions with the highest expected value while leveraging exploration?
Suppose you are in a Multi-Armed Bandit scenario:
- \(S\) – You are hungry and want to treat yourself to a restaurant meal.
- \(A_{1,\dots,k}\) – You can choose from \(k\) different restaurants in your area.
- \(R\) – After eating, you rate your experience — maybe based on taste, service, or price satisfaction.



How can you select a restaurant meal that maximizes your enjoyment while still being open to exploring something new like Falafel Bowls?
Epsilon Greedy is an algorithm that allows us to balance our decision-making in this simple manner.
Epsilon (\(\epsilon\)) is a fixed proportion that decides whether we explore or exploit our actions.
\[ A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \arg\max_a Q(a) \text{ with probability } 1 - \epsilon \end{cases} \]
In our real life example, at first, you try different restaurants (exploration) to see what’s good. Over time, you start favoring the ones with better rewards (exploitation). But occasionally, you still try a new one — just in case it’s better than your current favorite.