3.1 Multi-Armed Bandit Framework

A formal framework that defines probability using three fundamental rules, ensuring consistency in measuring uncertainty. 🎲

Bandit

A bandit is a slot machine.
It is used as an analogy to represent the action an agent can make in one state.
Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot, according to its underlying probability distribution.

\(\Huge{\to}\)

Rewards

Multi-Armed Bandit

Nonassociative Environments

A nonassociative environment is a setting that involves learning how to act in one state. The best example of a nonassociative environment is Multi-Armed Bandit’s.

A Multi-Armed Bandit can be interpreted as k-actions, or k-arms of the slot machines, to decide from.
Through repeated action selections, you maximize your winnings by concentrating actions on the best levers.

How do we decide the most appropriate action? 🤔

Expectation of a Bandit

Each bandit has an expected reward given a particular action is selected, called the action value.

\[ Q_t(a) = \mathbb{E}[R_t | A_t = a] \]

Where:

\(Q_t(a)\) is the conditional expectation of the rewards \(R_t\) given the selection of an action \(A_t\).
\(R_t\) is the random variable for the reward at time step \(t\).
\(A_t\) is the random variable for the action selected at time step \(t\).

Action Value Method

To compute expectations of action values and select actions, we use action value methods.

\[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i * \mathbf{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbf{1}_{A_i = a}} \]

Where:

\(Q_t(a)\) is the action value for a particular action \(a\).
\(\mathbf{1}\) is a predicate, which denotes whether \(A_i = a\) is true or false.

If the denominator is \(0\), then we denote \(Q_t(a)\) as \(0\).

Action Value Method Update

To avoid computationally expensive updates using the predicate method, we can update action values in an incremental fashion:

\[ Q_{t+1} = Q_t + \frac{1}{t} (R_t - Q_t) \]

\[ NewEstimate \gets OldEstimate + StepSize [Target - OldEstimate] \]

Should we always pick actions with the highest expected value? 🤔