6 research outputs found
Learning to Combat Compounding-Error in Model-Based Reinforcement Learning
Despite its potential to improve sample complexity versus model-free
approaches, model-based reinforcement learning can fail catastrophically if the
model is inaccurate. An algorithm should ideally be able to trust an imperfect
model over a reasonably long planning horizon, and only rely on model-free
updates when the model errors get infeasibly large. In this paper, we
investigate techniques for choosing the planning horizon on a state-dependent
basis, where a state's planning horizon is determined by the maximum cumulative
model error around that state. We demonstrate that these state-dependent model
errors can be learned with Temporal Difference methods, based on a novel
approach of temporally decomposing the cumulative model errors. Experimental
results show that the proposed method can successfully adapt the planning
horizon to account for state-dependent model accuracy, significantly improving
the efficiency of policy learning compared to model-based and model-free
baselines
Domain Knowledge Integration By Gradient Matching For Sample-Efficient Reinforcement Learning
Model-free deep reinforcement learning (RL) agents can learn an effective
policy directly from repeated interactions with a black-box environment.
However in practice, the algorithms often require large amounts of training
experience to learn and generalize well. In addition, classic model-free
learning ignores the domain information contained in the state transition
tuples. Model-based RL, on the other hand, attempts to learn a model of the
environment from experience and is substantially more sample efficient, but
suffers from significantly large asymptotic bias owing to the imperfect
dynamics model. In this paper, we propose a gradient matching algorithm to
improve sample efficiency by utilizing target slope information from the
dynamics predictor to aid the model-free learner. We demonstrate this by
presenting a technique for matching the gradient information from the
model-based learner with the model-free component in an abstract
low-dimensional space and validate the proposed technique through experimental
results that demonstrate the efficacy of this approach
Planning with Exploration: Addressing Dynamics Bottleneck in Model-based Reinforcement Learning
Model-based reinforcement learning (MBRL) is believed to have higher sample
efficiency compared with model-free reinforcement learning (MFRL). However,
MBRL is plagued by dynamics bottleneck dilemma. Dynamics bottleneck dilemma is
the phenomenon that the performance of the algorithm falls into the local
optimum instead of increasing when the interaction step with the environment
increases, which means more data can not bring better performance. In this
paper, we find that the trajectory reward estimation error is the main reason
that causes dynamics bottleneck dilemma through theoretical analysis. We give
an upper bound of the trajectory reward estimation error and point out that
increasing the agent's exploration ability is the key to reduce trajectory
reward estimation error, thereby alleviating dynamics bottleneck dilemma.
Motivated by this, a model-based control method combined with exploration named
MOdel-based Progressive Entropy-based Exploration (MOPE2) is proposed. We
conduct experiments on several complex continuous control benchmark tasks. The
results verify that MOPE2 can effectively alleviate dynamics bottleneck dilemma
and have higher sample efficiency than previous MBRL and MFRL algorithms.Comment: 15 pages, 8 figure
Bidirectional Model-based Policy Optimization
Model-based reinforcement learning approaches leverage a forward dynamics
model to support planning and decision making, which, however, may fail
catastrophically if the model is inaccurate. Although there are several
existing methods dedicated to combating the model error, the potential of the
single forward model is still limited. In this paper, we propose to
additionally construct a backward dynamics model to reduce the reliance on
accuracy in forward model predictions. We develop a novel method, called
Bidirectional Model-based Policy Optimization (BMPO) to utilize both the
forward model and backward model to generate short branched rollouts for policy
optimization. Furthermore, we theoretically derive a tighter bound of return
discrepancy, which shows the superiority of BMPO against the one using merely
the forward model. Extensive experiments demonstrate that BMPO outperforms
state-of-the-art model-based methods in terms of sample efficiency and
asymptotic performance.Comment: Accepted at ICML202
Learning Accurate Long-term Dynamics for Model-based Reinforcement Learning
Accurately predicting the dynamics of robotic systems is crucial for
model-based control and reinforcement learning. The most common way to estimate
dynamics is by fitting a one-step ahead prediction model and using it to
recursively propagate the predicted state distribution over long horizons.
Unfortunately, this approach is known to compound even small prediction errors,
making long-term predictions inaccurate. In this paper, we propose a new
parametrization to supervised learning on state-action data to stably predict
at longer horizons -- that we call a trajectory-based model. This
trajectory-based model takes an initial state, a future time index, and control
parameters as inputs, and predicts the state at the future time. Our results in
simulated and experimental robotic tasks show that our trajectory-based models
yield significantly more accurate long term predictions, improved sample
efficiency, and ability to predict task reward.Comment: 8 pages, +2 pages appendi
A Contraction Approach to Model-based Reinforcement Learning
Despite its experimental success, Model-based Reinforcement Learning still
lacks a complete theoretical understanding. To this end, we analyze the error
in the cumulative reward using a contraction approach. We consider both
stochastic and deterministic state transitions for continuous (non-discrete)
state and action spaces. This approach doesn't require strong assumptions and
can recover the typical quadratic error to the horizon. We prove that branched
rollouts can reduce this error and are essential for deterministic transitions
to have a Bellman contraction. Our analysis of policy mismatch error also
applies to Imitation Learning. In this case, we show that GAN-type learning has
an advantage over Behavioral Cloning when its discriminator is well-trained.Comment: The 24th International Conference on Artificial Intelligence and
Statistics (AISTATS) 202