9.3 Action Selections

What if every choice you made wasn’t based on value — but came straight from your instinct? 🎲

Now that we have a robust empirical gradient:

\[ \hat{g} = \frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}|S_{t}, \theta) \ \hat{A}_{t} \]

WarningProblem

If the policy \(\pi(A_{t}|S_{t}, \theta)\) is parametrized by a neural network \(\theta\), how do we select actions in action spaces \(\mathcal{A}\) that are discrete or continuous?

Hint: Think about Lecture 8

NoteExample: Pusher

Consider the following Pusher MuJoCo environment (Todorov, Erez, and Tassa 2012):

The environment has a continuous action space \(\mathcal{A}\):

\[ \mathcal{A} = \begin{bmatrix} \text{Rotation of the panning the shoulder} \in (-2,2) \\ \text{Rotation of the shoulder lifting joint} \in (-2,2) \\ \text{Rotation of the shoulder rolling joint} \in (-2,2) \\ \text{Rotation of hinge joint that flexed the elbow} \in (-2,2) \\ \text{Rotation of hinge that rolls the forearm} \in (-2,2) \\ \text{Rotation of flexing the wrist} \in (-2,2) \\ \text{Rotation of rolling the wrist} \in (-2,2) \\ \end{bmatrix} \]