3.3 Upper Confidence Boundary (UCB)
One risk that we incur with \(\epsilon\)-greedy is that it randomly explores — which might lead to suboptimal or redundant choices.
Is there a way that we can explore more intelligently?
Hint: Think about Lecture 2
Suppose you are in a Multi-Armed Bandit scenario:
- \(S\) – It’s a beautiful day, and you’re deciding where to go for a run.
- \(A_{1,\dots,k}\) – You can choose from \(k\) different running trails nearby.
- \(R\) – After each run, you mentally rate the experience — scenery, terrain, or how energized you felt.



Should you keep exploiting Trail A’s high reward, or sometimes choose Trail B — which has modest results but higher potential upside — instead of immediately exploring the completely unknown Trail C?
Upper Confidence Boundaries (UCB) allow us to select among the non-greedy actions according to their potential for actually being optimal.
\[ A_t \gets \arg\max_a \left[ Q(a) + \sqrt{\frac{2 \ln(t)}{N(a)}} \right] \]
Balancing Exploration and Exploitation
Each time \(a\) is selected, the uncertainty is presumably reduced \(N(a)\) increments, and as it appears in the denominator, the uncertainty term decreases.
\[ VAR \downarrow = \sqrt{\frac{2 \ln(t)}{N(a)\uparrow}} \]
Each time an action other than \(a\) is selected, \(t\) increases but \(N(a)\) does not; because \(t\) appears in the numerator, the uncertainty estimate increases.
\[ VAR \uparrow = \sqrt{\frac{2 \ln(t) \uparrow}{N(a)}} \]
In our Real Life Example UCB says “Don’t just pick randomly. Prefer trails that are either great or still uncertain — they might surprise you.”
Pseudocode
\(\sqrt{\frac{2 \ln(t)}{N(a)}}\) is the measure of variance of the action \(a\). The natural logarithm increases get smaller over time but are unbounded, so all actions will be selected.