3.3 Upper Confidence Boundary (UCB)

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Upper Confidence Boundaries

Upper Confidence Boundaries allow us to select among the non-greedy actions according to their potential for actually being optimal.

\[ A_t \gets \arg\max_a \left[ Q(a) + \sqrt{\frac{2 \ln(t)}{N(a)}} \right] \]

Where:

$\sqrt{\frac{2 \ln(t)}{N(a)}}$ is the measure of variance of the action $a$.
The natural logarithm increases get smaller over time but are unbounded, so all actions will be selected.

UCB Exploring vs. Exploiting

Each time $a$ is selected, the uncertainty is presumably reduced: $N(a)$ increments, and as it appears in the denominator, the uncertainty term decreases.

\[ VAR \downarrow = \sqrt{\frac{2 \ln(t)}{N(a)\uparrow}} \]

Each time an action other than $a$ is selected, $t$ increases but $N(a)$ does not; because $t$ appears in the numerator, the uncertainty estimate increases.

\[ VAR \uparrow = \sqrt{\frac{2 \ln(t) \uparrow}{N(a)}} \]

Pseudocode

\begin{algorithm} \caption{MAB Upper Confidence Boundary (UCB)} \begin{algorithmic} \State Initialize, for $a = 1$ to $k$: \State $Q(a) \gets 0$ \State $N(a) \gets 0$ \\ \For{$t$ in range($len(data)$)} \State $A_t \gets argmax_a[Q(a) + \sqrt{(\frac{2 ln(t)}{N(a)})}]$ \State $R_t \gets \text{bandit}(A_t)$ \State $N(A_t) \gets N(A_t) + 1$ \State $Q(A_t) \gets Q(A_t) + \frac{1}{N(A_t)}[R_t - Q(A_t)]$ \Endfor \end{algorithmic} \end{algorithm}