3.3 Upper Confidence Boundary (UCB)

What if you trusted things that haven’t let you down — but gave others a fair shot too? 🔍

One risk that we incur with \(\epsilon\)-greedy is that it randomly explores — which might lead to suboptimal or redundant choices.

WarningProblem

Is there a way that we can explore more intelligently?

Hint: Think about Lecture 2

NoteReal Life Example 🧠

Suppose you are in a Multi-Armed Bandit scenario:

  • \(S\) – It’s a beautiful day, and you’re deciding where to go for a run.
  • \(A_{1,\dots,k}\) – You can choose from \(k\) different running trails nearby.
  • \(R\) – After each run, you mentally rate the experience — scenery, terrain, or how energized you felt.

Action \(A_1\): Running Trail A. Reward \(R\): This loop has consistently provided the highest enjoyment — strong scenery and terrain make it the best-known choice.

Action \(A_2\): Running Trail B. Reward \(R\): A modest performer so far, but not explored much. With more runs, it could reveal higher potential than currently estimated.

Action \(A_3\): Running Trail C. Reward \(R\): A completely new option — its value is entirely uncertain, and initial selections may not be the most informative.

Should you keep exploiting Trail A’s high reward, or sometimes choose Trail B — which has modest results but higher potential upside — instead of immediately exploring the completely unknown Trail C?

Upper Confidence Boundaries (UCB) allow us to select among the non-greedy actions according to their potential for actually being optimal.

\[ A_t \gets \arg\max_a \left[ Q(a) + \sqrt{\frac{2 \ln(t)}{N(a)}} \right] \]


Balancing Exploration and Exploitation

Each time \(a\) is selected, the uncertainty is presumably reduced \(N(a)\) increments, and as it appears in the denominator, the uncertainty term decreases.

\[ VAR \downarrow = \sqrt{\frac{2 \ln(t)}{N(a)\uparrow}} \]

Each time an action other than \(a\) is selected, \(t\) increases but \(N(a)\) does not; because \(t\) appears in the numerator, the uncertainty estimate increases.

\[ VAR \uparrow = \sqrt{\frac{2 \ln(t) \uparrow}{N(a)}} \]

In our Real Life Example UCB says “Don’t just pick randomly. Prefer trails that are either great or still uncertain — they might surprise you.”


Pseudocode

\(\sqrt{\frac{2 \ln(t)}{N(a)}}\) is the measure of variance of the action \(a\). The natural logarithm increases get smaller over time but are unbounded, so all actions will be selected.