32,704 research outputs found
Compatible natural gradient policy search
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks
Compatible natural gradient policy search
Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a
new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks
Scaling Reinforcement Learning Paradigms for Motor Control
Reinforcement learning offers a general framework to explain reward related learning in artificial and biological motor control. However, current reinforcement learning methods rarely scale to high dimensional movement systems and mainly operate in discrete, low dimensional domains like game-playing, artificial toy problems, etc. This drawback makes them unsuitable for application to human or bio-mimetic motor control. In this poster, we look at promising approaches that can potentially scale and suggest a novel formulation of the actor-critic algorithm which takes steps towards alleviating the current shortcomings. We argue that methods based on greedy policies are not likely to scale into high-dimensional domains as they are problematic when used with function approximation a must when dealing with continuous domains. We adopt the path of direct policy gradient based policy improvements since they avoid the problems of unstabilizing dynamics encountered in traditional value iteration based updates. While regular policy gradient methods have demonstrated promising results in the domain of humanoid notor control, we demonstrate that these methods can be significantly improved using the natural policy gradient instead of the regular policy gradient. Based on this, it is proved that Kakades average natural policy gradient is indeed the true natural gradient. A general algorithm for estimating the natural gradient, the Natural Actor-Critic algorithm, is introduced. This algorithm converges with probability one to the nearest local minimum in Riemannian space of the cost function. The algorithm outperforms nonnatural policy gradients by far in a cart-pole balancing evaluation, and offers a promising route for the development of reinforcement learning for truly high-dimensionally continuous state-action systems. Keywords: Reinforcement learning, neurodynamic programming, actorcritic methods, policy gradient methods, natural policy gradien
f-Divergence constrained policy improvement
To ensure stability of learning, state-of-the-art generalized policy
iteration algorithms augment the policy improvement step with a trust region
constraint bounding the information loss. The size of the trust region is
commonly determined by the Kullback-Leibler (KL) divergence, which not only
captures the notion of distance well but also yields closed-form solutions. In
this paper, we consider a more general class of f-divergences and derive the
corresponding policy update rules. The generic solution is expressed through
the derivative of the convex conjugate function to f and includes the KL
solution as a special case. Within the class of f-divergences, we further focus
on a one-parameter family of -divergences to study effects of the
choice of divergence on policy improvement. Previously known as well as new
policy updates emerge for different values of . We show that every type
of policy update comes with a compatible policy evaluation resulting from the
chosen f-divergence. Interestingly, the mean-squared Bellman error minimization
is closely related to policy evaluation with the Pearson -divergence
penalty, while the KL divergence results in the soft-max policy update and a
log-sum-exp critic. We carry out asymptotic analysis of the solutions for
different values of and demonstrate the effects of using different
divergence functions on a multi-armed bandit problem and on common standard
reinforcement learning problems
Expected Policy Gradients
We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.Comment: Conference paper, AAAI-18, 12 pages including supplemen
- …