3.2 ε-Greedy

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Exploring vs. Exploiting

  • We are exploring when we randomly select an action.

Intuition: Acting randomly.

  • We are exploiting when an action is selected based on its expected value. When we act this way, we are said to be acting in a greedy manner.

Intuition: Acting systematically.

Conflict of Exploring vs. Exploiting

  • Exploring all of the time does not permit you to exploit your knowledge of expected values.

  • Exploiting all of the time does not permit you to explore all of the options.

Thus, our decision-making must encompass a balance of exploring and exploiting actions.

The Role of ε

Epsilon (\(\epsilon\)) is a fixed proportion that decides whether we explore or exploit our actions. \[ A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \arg\max_a Q(a) \text{ with probability } 1 - \epsilon \end{cases} \]

Hence, Epsilon Greedy is an algorithm that allows us to balance our decision-making in this simple manner.

Pseudocode

\begin{algorithm} \caption{MAB $\epsilon$-Greedy} \begin{algorithmic} \State Initialize, for $a = 1$ to $k$: \State $Q(a) \gets 0$ \State $N(a) \gets 0$ \\ \For{$t$ in range($len(data)$)} \State $A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \text{argmax}_a\, Q(a) \text{ with probability } 1-\epsilon \end{cases}$ \State $R_t \gets \text{bandit}(A_t)$ \State $N(A_t) \gets N(A_t) + 1$ \State $Q(A_t) \gets Q(A_t) + \frac{1}{N(A_t)}[R_t - Q(A_t)]$ \Endfor \end{algorithmic} \end{algorithm}

GitHub