25,886 research outputs found
Deterministic Value-Policy Gradients
Reinforcement learning algorithms such as the deep deterministic policy
gradient algorithm (DDPG) has been widely used in continuous control tasks.
However, the model-free DDPG algorithm suffers from high sample complexity. In
this paper we consider the deterministic value gradients to improve the sample
efficiency of deep reinforcement learning algorithms. Previous works consider
deterministic value gradients with the finite horizon, but it is too myopic
compared with infinite horizon. We firstly give a theoretical guarantee of the
existence of the value gradients in this infinite setting. Based on this
theoretical guarantee, we propose a class of the deterministic value gradient
algorithm (DVG) with infinite horizon, and different rollout steps of the
analytical gradients by the learned model trade off between the variance of the
value gradients and the model bias. Furthermore, to better combine the
model-based deterministic value gradient estimators with the model-free
deterministic policy gradient estimator, we propose the deterministic
value-policy gradient (DVPG) algorithm. We finally conduct extensive
experiments comparing DVPG with state-of-the-art methods on several standard
continuous control benchmarks. Results demonstrate that DVPG substantially
outperforms other baselines
Scaling Reinforcement Learning Paradigms for Motor Control
Reinforcement learning offers a general framework to explain reward related learning in artificial and biological motor control. However, current reinforcement learning methods rarely scale to high dimensional movement systems and mainly operate in discrete, low dimensional domains like game-playing, artificial toy problems, etc. This drawback makes them unsuitable for application to human or bio-mimetic motor control. In this poster, we look at promising approaches that can potentially scale and suggest a novel formulation of the actor-critic algorithm which takes steps towards alleviating the current shortcomings. We argue that methods based on greedy policies are not likely to scale into high-dimensional domains as they are problematic when used with function approximation a must when dealing with continuous domains. We adopt the path of direct policy gradient based policy improvements since they avoid the problems of unstabilizing dynamics encountered in traditional value iteration based updates. While regular policy gradient methods have demonstrated promising results in the domain of humanoid notor control, we demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular policy gradient. Based on this, it is proved that Kakades average natural policy gradient is indeed the true natural gradient. A general algorithm for estimating the natural gradient, the Natural Actor-Critic algorithm, is introduced. This algorithm converges with probability one to the nearest local minimum in Riemannian space of the cost function. The algorithm outperforms nonnatural policy gradients by far in a cart-pole balancing evaluation, and offers a promising route for the development of reinforcement learning for truly high-dimensionally continuous state-action systems. Keywords: Reinforcement learning, neurodynamic programming, actorcritic methods, policy gradient methods, natural policy gradien
- …