3.1 Multi-Armed Bandit Framework
Bandit
A bandit is a slot machine.
It is used as an analogy to represent the action an agent can make in one state.
Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot, according to its underlying probability distribution.
\(\Huge{\to}\)
Rewards
Multi-Armed Bandit
A nonassociative environment is a setting that involves learning how to act in one state. The best example of a nonassociative environment is Multi-Armed Bandit’s.
A Multi-Armed Bandit can be interpreted as k-actions, or k-arms of the slot machines, to decide from.
Through repeated action selections, you maximize your winnings by concentrating actions on the best levers.
Expectation of a Bandit
Each bandit has an expected reward given a particular action is selected, called the action value.
\[ Q_t(a) = \mathbb{E}[R_t | A_t = a] \]
Where:
- \(Q_t(a)\) is the conditional expectation of the rewards \(R_t\) given the selection of an action \(A_t\).
- \(R_t\) is the random variable for the reward at time step \(t\).
- \(A_t\) is the random variable for the action selected at time step \(t\).
Action Value Method
To compute expectations of action values and select actions, we use action value methods.
\[ Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i * \mathbf{1}_{A_i = a}}{\sum_{i=1}^{t-1} \mathbf{1}_{A_i = a}} \]
Where:
- \(Q_t(a)\) is the action value for a particular action \(a\).
- \(\mathbf{1}\) is a predicate, which denotes whether \(A_i = a\) is true or false.
If the denominator is \(0\), then we denote \(Q_t(a)\) as \(0\).
Action Value Method Update
To avoid computationally expensive updates using the predicate method, we can update action values in an incremental fashion:
\[ Q_{t+1} = Q_t + \frac{1}{t} (R_t - Q_t) \]
or
\[ NewEstimate \gets OldEstimate + StepSize [Target - OldEstimate] \]