3.3 Upper Confidence Boundary (UCB)
Upper Confidence Boundaries
Upper Confidence Boundaries allow us to select among the non-greedy actions according to their potential for actually being optimal.
\[ A_t \gets \arg\max_a \left[ Q(a) + \sqrt{\frac{2 \ln(t)}{N(a)}} \right] \]
Where:
- \(\sqrt{\frac{2 \ln(t)}{N(a)}}\) is the measure of variance of the action \(a\).
- The natural logarithm increases get smaller over time but are unbounded, so all actions will be selected.
UCB Exploring vs. Exploiting
Each time \(a\) is selected, the uncertainty is presumably reduced: \(N(a)\) increments, and as it appears in the denominator, the uncertainty term decreases.
\[ VAR \downarrow = \sqrt{\frac{2 \ln(t)}{N(a)\uparrow}} \]
Each time an action other than \(a\) is selected, \(t\) increases but \(N(a)\) does not; because \(t\) appears in the numerator, the uncertainty estimate increases.
\[ VAR \uparrow = \sqrt{\frac{2 \ln(t) \uparrow}{N(a)}} \]
Pseudocode
\begin{algorithm} \caption{MAB Upper Confidence Boundary (UCB)} \begin{algorithmic} \State Initialize, for $a = 1$ to $k$: \State $Q(a) \gets 0$ \State $N(a) \gets 0$ \\ \For{$t$ in range($len(data)$)} \State $A_t \gets argmax_a[Q(a) + \sqrt{(\frac{2 ln(t)}{N(a)})}]$ \State $R_t \gets \text{bandit}(A_t)$ \State $N(A_t) \gets N(A_t) + 1$ \State $Q(A_t) \gets Q(A_t) + \frac{1}{N(A_t)}[R_t - Q(A_t)]$ \Endfor \end{algorithmic} \end{algorithm}