3.3 Upper Confidence Boundary (UCB)

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Upper Confidence Boundaries

Upper Confidence Boundaries allow us to select among the non-greedy actions according to their potential for actually being optimal.

\[ A_t \gets \arg\max_a \left[ Q(a) + \sqrt{\frac{2 \ln(t)}{N(a)}} \right] \]

Where:

  • \(\sqrt{\frac{2 \ln(t)}{N(a)}}\) is the measure of variance of the action \(a\).
  • The natural logarithm increases get smaller over time but are unbounded, so all actions will be selected.

UCB Exploring vs. Exploiting

Each time \(a\) is selected, the uncertainty is presumably reduced: \(N(a)\) increments, and as it appears in the denominator, the uncertainty term decreases.

\[ VAR \downarrow = \sqrt{\frac{2 \ln(t)}{N(a)\uparrow}} \]

Each time an action other than \(a\) is selected, \(t\) increases but \(N(a)\) does not; because \(t\) appears in the numerator, the uncertainty estimate increases.

\[ VAR \uparrow = \sqrt{\frac{2 \ln(t) \uparrow}{N(a)}} \]

Pseudocode

\begin{algorithm} \caption{MAB Upper Confidence Boundary (UCB)} \begin{algorithmic} \State Initialize, for $a = 1$ to $k$: \State $Q(a) \gets 0$ \State $N(a) \gets 0$ \\ \For{$t$ in range($len(data)$)} \State $A_t \gets argmax_a[Q(a) + \sqrt{(\frac{2 ln(t)}{N(a)})}]$ \State $R_t \gets \text{bandit}(A_t)$ \State $N(A_t) \gets N(A_t) + 1$ \State $Q(A_t) \gets Q(A_t) + \frac{1}{N(A_t)}[R_t - Q(A_t)]$ \Endfor \end{algorithmic} \end{algorithm}

GitHub