40,973 research outputs found
Conservative State Value Estimation for Offline Reinforcement Learning
Offline reinforcement learning faces a significant challenge of value
over-estimation due to the distributional drift between the dataset and the
current learned policy, leading to learning failure in practice. The common
approach is to incorporate a penalty term to reward or value estimation in the
Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution
(OOD) states and actions, existing methods focus on conservative Q-function
estimation. In this paper, we propose Conservative State Value Estimation
(CSVE), a new approach that learns conservative V-function via directly
imposing penalty on OOD states. Compared to prior work, CSVE allows more
effective in-data policy optimization with conservative value guarantees.
Further, we apply CSVE and develop a practical actor-critic algorithm in which
the critic does the conservative value estimation by additionally sampling and
penalizing the states \emph{around} the dataset, and the actor applies
advantage weighted updates extended with state exploration to improve the
policy. We evaluate in classic continual control tasks of D4RL, showing that
our method performs better than the conservative Q-function learning methods
and is strongly competitive among recent SOTA methods
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Value-based reinforcement-learning algorithms provide state-of-the-art
results in model-free discrete-action settings, and tend to outperform
actor-critic algorithms. We argue that actor-critic algorithms are limited by
their need for an on-policy critic. We propose Bootstrapped Dual Policy
Iteration (BDPI), a novel model-free reinforcement-learning algorithm for
continuous states and discrete actions, with an actor and several off-policy
critics. Off-policy critics are compatible with experience replay, ensuring
high sample-efficiency, without the need for off-policy corrections. The actor,
by slowly imitating the average greedy policy of the critics, leads to
high-quality and state-specific exploration, which we compare to Thompson
sampling. Because the actor and critics are fully decoupled, BDPI is remarkably
stable, and unusually robust to its hyper-parameters. BDPI is significantly
more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete,
continuous and pixel-based tasks. Source code:
https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML
Safe Exploration Method for Reinforcement Learning under Existence of Disturbance
Recent rapid developments in reinforcement learning algorithms have been
giving us novel possibilities in many fields. However, due to their exploring
property, we have to take the risk into consideration when we apply those
algorithms to safety-critical problems especially in real environments. In this
study, we deal with a safe exploration problem in reinforcement learning under
the existence of disturbance. We define the safety during learning as
satisfaction of the constraint conditions explicitly defined in terms of the
state and propose a safe exploration method that uses partial prior knowledge
of a controlled object and disturbance. The proposed method assures the
satisfaction of the explicit state constraints with a pre-specified probability
even if the controlled object is exposed to a stochastic disturbance following
a normal distribution. As theoretical results, we introduce sufficient
conditions to construct conservative inputs not containing an exploring aspect
used in the proposed method and prove that the safety in the above explained
sense is guaranteed with the proposed method. Furthermore, we illustrate the
validity and effectiveness of the proposed method through numerical simulations
of an inverted pendulum and a four-bar parallel link robot manipulator.Comment: Accepted to the European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD) 202
Conservative Dual Policy Optimization for Efficient Model-Based Reinforcement Learning
Provably efficient Model-Based Reinforcement Learning (MBRL) based on
optimism or posterior sampling (PSRL) is ensured to attain the global
optimality asymptotically by introducing the complexity measure of the model.
However, the complexity might grow exponentially for the simplest nonlinear
models, where global convergence is impossible within finite iterations. When
the model suffers a large generalization error, which is quantitatively
measured by the model complexity, the uncertainty can be large. The sampled
model that current policy is greedily optimized upon will thus be unsettled,
resulting in aggressive policy updates and over-exploration. In this work, we
propose Conservative Dual Policy Optimization (CDPO) that involves a
Referential Update and a Conservative Update. The policy is first optimized
under a reference model, which imitates the mechanism of PSRL while offering
more stability. A conservative range of randomness is guaranteed by maximizing
the expectation of model value. Without harmful sampling procedures, CDPO can
still achieve the same regret as PSRL. More importantly, CDPO enjoys monotonic
policy improvement and global optimality simultaneously. Empirical results also
validate the exploration efficiency of CDPO.Comment: Published at NeurIPS 202
Careful at Estimation and Bold at Exploration
Exploration strategies in continuous action space are often heuristic due to
the infinite actions, and these kinds of methods cannot derive a general
conclusion. In prior work, it has been shown that policy-based exploration is
beneficial for continuous action space in deterministic policy reinforcement
learning(DPRL). However, policy-based exploration in DPRL has two prominent
issues: aimless exploration and policy divergence, and the policy gradient for
exploration is only sometimes helpful due to inaccurate estimation. Based on
the double-Q function framework, we introduce a novel exploration strategy to
mitigate these issues, separate from the policy gradient. We first propose the
greedy Q softmax update schema for Q value update. The expected Q value is
derived by weighted summing the conservative Q value over actions, and the
weight is the corresponding greedy Q value. Greedy Q takes the maximum value of
the two Q functions, and conservative Q takes the minimum value of the two
different Q functions. For practicality, this theoretical basis is then
extended to allow us to combine action exploration with the Q value update,
except for the premise that we have a surrogate policy that behaves like this
exploration policy. In practice, we construct such an exploration policy with a
few sampled actions, and to meet the premise, we learn such a surrogate policy
by minimizing the KL divergence between the target policy and the exploration
policy constructed by the conservative Q. We evaluate our method on the Mujoco
benchmark and demonstrate superior performance compared to previous
state-of-the-art methods across various environments, particularly in the most
complex Humanoid environment.Comment: 20 page
- …