9.3 Action Selections
Now that we have a robust empirical gradient:
\[ \hat{g} = \frac{1}{m} \sum^{m}_{i = 1} \sum^{T-1}_{t=0} \log \nabla_\theta \pi(A_{t}|S_{t}, \theta) \ \hat{A}_{t} \]
If the policy \(\pi(A_{t}|S_{t}, \theta)\) is parametrized by a neural network \(\theta\), how do we select actions in action spaces \(\mathcal{A}\) that are discrete or continuous?
Hint: Think about Lecture 8
Consider the following Pusher MuJoCo environment (Todorov, Erez, and Tassa 2012):
The environment has a continuous action space \(\mathcal{A}\):
\[ \mathcal{A} = \begin{bmatrix} \text{Rotation of the panning the shoulder} \in (-2,2) \\ \text{Rotation of the shoulder lifting joint} \in (-2,2) \\ \text{Rotation of the shoulder rolling joint} \in (-2,2) \\ \text{Rotation of hinge joint that flexed the elbow} \in (-2,2) \\ \text{Rotation of hinge that rolls the forearm} \in (-2,2) \\ \text{Rotation of flexing the wrist} \in (-2,2) \\ \text{Rotation of rolling the wrist} \in (-2,2) \\ \end{bmatrix} \]
