18,045 research outputs found
Deterministic Value-Policy Gradients
Reinforcement learning algorithms such as the deep deterministic policy
gradient algorithm (DDPG) has been widely used in continuous control tasks.
However, the model-free DDPG algorithm suffers from high sample complexity. In
this paper we consider the deterministic value gradients to improve the sample
efficiency of deep reinforcement learning algorithms. Previous works consider
deterministic value gradients with the finite horizon, but it is too myopic
compared with infinite horizon. We firstly give a theoretical guarantee of the
existence of the value gradients in this infinite setting. Based on this
theoretical guarantee, we propose a class of the deterministic value gradient
algorithm (DVG) with infinite horizon, and different rollout steps of the
analytical gradients by the learned model trade off between the variance of the
value gradients and the model bias. Furthermore, to better combine the
model-based deterministic value gradient estimators with the model-free
deterministic policy gradient estimator, we propose the deterministic
value-policy gradient (DVPG) algorithm. We finally conduct extensive
experiments comparing DVPG with state-of-the-art methods on several standard
continuous control benchmarks. Results demonstrate that DVPG substantially
outperforms other baselines
Expected Policy Gradients
We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.Comment: Conference paper, AAAI-18, 12 pages including supplemen
Deep Residual Reinforcement Learning
We revisit residual algorithms in both model-free and model-based
reinforcement learning settings. We propose the bidirectional target network
technique to stabilize residual algorithms, yielding a residual version of DDPG
that significantly outperforms vanilla DDPG in the DeepMind Control Suite
benchmark. Moreover, we find the residual algorithm an effective approach to
the distribution mismatch problem in model-based planning. Compared with the
existing TD() method, our residual-based method makes weaker assumptions
about the model and yields a greater performance boost.Comment: AAMAS 202
- …