1 research outputs found
Zeroth-Order Supervised Policy Improvement
Policy gradient (PG) algorithms have been widely used in reinforcement
learning (RL). However, PG algorithms rely on exploiting the value function
being learned with the first-order update locally, which results in limited
sample efficiency. In this work, we propose an alternative method called
Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the
estimated value function globally while preserving the local exploitation
of the PG methods based on zeroth-order policy optimization. This learning
paradigm follows Q-learning but overcomes the difficulty of efficiently
operating argmax in continuous action space. It finds max-valued action within
a small number of samples. The policy learning of ZOSPI has two steps: First,
it samples actions and evaluates those actions with a learned value estimator,
and then it learns to perform the action with the highest value through
supervised learning. We further demonstrate such a supervised learning
framework can learn multi-modal policies. Experiments show that ZOSPI achieves
competitive results on the continuous control benchmarks with a remarkable
sample efficiency