ppo drama

how to use ppo for gpu-parallelised robot manipulation

PPO drama

there are lots of tricks beyond what was introduced.

resources:

here I keep a list of them that I have found relevant for robotic manipulation continuous control

Early stopping KL divergence - generally this works well? I use KL div threshold of 0.1

Epsilon parameter: use 1-e5

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)

Orthogonal Initialization of Weights and Constant Initialization of biases Policy heads: gain 0.01 Value: gain 1 Encoder: gain root 2
entropy Andrychowicz, et al. (2021) overall find no evidence that the entropy term improves performance on continuous control environments (decision C13, figure 76 and 77).
Continuous actions via normal distributions Gaussian most popular choice try bump
while Haarnoja et al., (2018) uses the state-dependent standard deviation, that is, the mean and standard deviation are output at the same time