3.4 Thompson Sampling

What if you made decisions by imagining how the world might be — over and over? 🧠

We want to explore options proportionally to how likely they are to be optimal, not just randomly or based on confidence bounds.

WarningProblem

Is there a way that we can select actions based out of a distribution of our beliefs?

Hint: Think about Lecture 2

NoteReal Life Example 🧠

Suppose you are in a Multi-Armed Bandit scenario:

  • \(S\) – It’s Friday night, and you’re deciding what show or YouTube channel to watch.
  • \(A_{1,\dots,k}\) – You can choose from \(k\) different shows or creators.
  • \(R\) – After each episode or video, you mentally rate how entertaining or satisfying it was.

Action \(A_1\): Watching Seinfeld. Reward \(R\): Your comfort show — consistently delivers strong enjoyment, so its reward distribution is centered high.

Action \(A_2\): Watching The Office. Reward \(R\): Only a few episodes watched — the distribution is more uncertain, leaving room for it to surprise you.

Action \(A_3\): Watching House. Reward \(R\): A completely new show — no data yet, so its distribution is wide and uncertain.

How can you select a show using your prior beliefs, while still leaving room for updates as new evidence comes in?

Thompson sampling is an algorithm that leverages the beta distribution for its action value method and action selections.

Initialize beta distributed with parameters:

\[ \alpha(a) = (\alpha_{1}, . . . , \alpha_{k}) = 1 \] \[ \beta(a) = (\beta_{1}, . . . , \beta_{k}) = 1 \]

Now for each action \(a\), the prior probability density function of our action value method \(Q(a)\) is:

\[ Q(a) = \frac{\Gamma(\alpha(a) + \beta(a))}{\Gamma(\alpha(a)) \ \Gamma(\beta(a))} a^{\alpha(a)-1} (1 - a)^{\beta(a)-1} \]


Balancing Exploration and Exploitation

The agent updates its prior belief using the following action value method:

\[ \alpha(A_{t}) \gets \alpha(A_{t}) + R_{t} \] \[ \beta(A_{t}) \gets \beta(A_{t}) + 1 - R_{t} \]

Notice that for those actions selected \(A_t\), we increase its corresponding \(\alpha\) parameter (\(R_t = 1\)) and maintain its \(\beta\) parameter the same as before (\(1 - R_t = 1 - 1 = 0\)).

This update allows the algorithm to draw accurate samples and strike a balance between exploring and exploiting.


Pseudocode