3.2 ε-Greedy
Exploring vs. Exploiting
- We are exploring when we randomly select an action.
Intuition: Acting randomly.
- We are exploiting when an action is selected based on its expected value. When we act this way, we are said to be acting in a greedy manner.
Intuition: Acting systematically.
Conflict of Exploring vs. Exploiting
Exploring all of the time does not permit you to exploit your knowledge of expected values.
Exploiting all of the time does not permit you to explore all of the options.
Thus, our decision-making must encompass a balance of exploring and exploiting actions.
The Role of ε
Epsilon (\(\epsilon\)) is a fixed proportion that decides whether we explore or exploit our actions. \[ A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \arg\max_a Q(a) \text{ with probability } 1 - \epsilon \end{cases} \]
Hence, Epsilon Greedy is an algorithm that allows us to balance our decision-making in this simple manner.
Pseudocode
\begin{algorithm} \caption{MAB $\epsilon$-Greedy} \begin{algorithmic} \State Initialize, for $a = 1$ to $k$: \State $Q(a) \gets 0$ \State $N(a) \gets 0$ \\ \For{$t$ in range($len(data)$)} \State $A_t \gets \begin{cases} \text{a random action with probability } \epsilon \\ \text{argmax}_a\, Q(a) \text{ with probability } 1-\epsilon \end{cases}$ \State $R_t \gets \text{bandit}(A_t)$ \State $N(A_t) \gets N(A_t) + 1$ \State $Q(A_t) \gets Q(A_t) + \frac{1}{N(A_t)}[R_t - Q(A_t)]$ \Endfor \end{algorithmic} \end{algorithm}