4,841 research outputs found
Randomized Prior Functions for Deep Reinforcement Learning
Dealing with uncertainty is essential for efficient reinforcement learning.
There is a growing literature on uncertainty estimation for deep learning from
fixed datasets, but many of the most popular approaches are poorly-suited to
sequential decision problems. Other methods, such as bootstrap sampling, have
no mechanism for uncertainty that does not come from the observed data. We
highlight why this can be a crucial shortcoming and propose a simple remedy
through addition of a randomized untrainable `prior' network to each ensemble
member. We prove that this approach is efficient with linear representations,
provide simple illustrations of its efficacy with nonlinear representations and
show that this approach scales to large-scale problems far better than previous
attempts
Successor Uncertainties: Exploration and Uncertainty in Temporal Difference Learning
Posterior sampling for reinforcement learning (PSRL) is an effective method
for balancing exploration and exploitation in reinforcement learning.
Randomised value functions (RVF) can be viewed as a promising approach to
scaling PSRL. However, we show that most contemporary algorithms combining RVF
with neural network function approximation do not possess the properties which
make PSRL effective, and provably fail in sparse reward problems. Moreover, we
find that propagation of uncertainty, a property of PSRL previously thought
important for exploration, does not preclude this failure. We use these
insights to design Successor Uncertainties (SU), a cheap and easy to implement
RVF algorithm that retains key properties of PSRL. SU is highly effective on
hard tabular exploration benchmarks. Furthermore, on the Atari 2600 domain, it
surpasses human performance on 38 of 49 games tested (achieving a median human
normalised score of 2.09), and outperforms its closest RVF competitor,
Bootstrapped DQN, on 36 of those.Comment: Camera ready version, NeurIPS 201
Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
Model-based reinforcement learning (RL) algorithms can attain excellent
sample efficiency, but often lag behind the best model-free algorithms in terms
of asymptotic performance. This is especially true with high-capacity
parametric function approximators, such as deep networks. In this paper, we
study how to bridge this gap, by employing uncertainty-aware dynamics models.
We propose a new algorithm called probabilistic ensembles with trajectory
sampling (PETS) that combines uncertainty-aware deep network dynamics models
with sampling-based uncertainty propagation. Our comparison to state-of-the-art
model-based and model-free deep RL algorithms shows that our approach matches
the asymptotic performance of model-free algorithms on several challenging
benchmark tasks, while requiring significantly fewer samples (e.g., 8 and 125
times fewer samples than Soft Actor Critic and Proximal Policy Optimization
respectively on the half-cheetah task).Comment: NIPS 2018, video and code available at
https://sites.google.com/view/drl-in-a-handful-of-trials
A Model-Based Reinforcement Learning Approach for a Rare Disease Diagnostic Task
In this work, we present our various contributions to the objective of
building a decision support tool for the diagnosis of rare diseases. Our goal
is to achieve a state of knowledge where the uncertainty about the patient's
disease is below a predetermined threshold. We aim to reach such states while
minimizing the average number of medical tests to perform. In doing so, we take
into account the need, in many medical applications, to avoid, as much as
possible, any misdiagnosis. To solve this optimization task, we investigate
several reinforcement learning algorithm and make them operable in our
high-dimensional and sparse-reward setting. We also present a way to combine
expert knowledge, expressed as conditional probabilities, with real clinical
data. This is crucial because the scarcity of data in the field of rare
diseases prevents any approach based solely on clinical data. Finally we show
that it is possible to integrate the ontological information about symptoms
while remaining in our probabilistic reasoning. It enables our decision support
tool to process information given at different level of precision by the user.Comment: 24 page
Efficient Exploration through Bayesian Deep Q-Networks
We study reinforcement learning (RL) in high dimensional episodic Markov
decision processes (MDP). We consider value-based RL when the optimal Q-value
is a linear function of d-dimensional state-action feature representation. For
instance, in deep-Q networks (DQN), the Q-value is a linear function of the
feature representation layer (output layer). We propose two algorithms, one
based on optimism, LINUCB, and another based on posterior sampling, LINPSRL. We
guarantee frequentist and Bayesian regret upper bounds of O(d sqrt{T}) for
these two algorithms, where T is the number of episodes. We extend these
methods to deep RL and propose Bayesian deep Q-networks (BDQN), which uses an
efficient Thompson sampling algorithm for high dimensional RL. We deploy the
double DQN (DDQN) approach, and instead of learning the last layer of Q-network
using linear regression, we use Bayesian linear regression, resulting in an
approximated posterior over Q-function. This allows us to directly incorporate
the uncertainty over the Q-function and deploy Thompson sampling on the learned
posterior distribution resulting in efficient exploration/exploitation
trade-off. We empirically study the behavior of BDQN on a wide range of Atari
games. Since BDQN carries out more efficient exploration and exploitation, it
is able to reach higher return substantially faster compared to DDQN
BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems
We present a new algorithm that significantly improves the efficiency of
exploration for deep Q-learning agents in dialogue systems. Our agents explore
via Thompson sampling, drawing Monte Carlo samples from a Bayes-by-Backprop
neural network. Our algorithm learns much faster than common exploration
strategies such as -greedy, Boltzmann, bootstrapping, and
intrinsic-reward-based ones. Additionally, we show that spiking the replay
buffer with experiences from just a few successful episodes can make Q-learning
feasible when it might otherwise fail.Comment: 13 pages, 9 figure
Model-Augmented Actor-Critic: Backpropagating through Paths
Current model-based reinforcement learning approaches use the model simply as
a learned black-box simulator to augment the data for policy optimization or
value function learning. In this paper, we show how to make more effective use
of the model by exploiting its differentiability. We construct a policy
optimization algorithm that uses the pathwise derivative of the learned model
and policy across future timesteps. Instabilities of learning across many
timesteps are prevented by using a terminal value function, learning the policy
in an actor-critic fashion. Furthermore, we present a derivation on the
monotonic improvement of our objective in terms of the gradient error in the
model and value function. We show that our approach (i) is consistently more
sample efficient than existing state-of-the-art model-based algorithms, (ii)
matches the asymptotic performance of model-free algorithms, and (iii) scales
to long horizons, a regime where typically past model-based approaches have
struggled.Comment: Accepted paper at ICLR 202
Deep Exploration via Randomized Value Functions
We study the use of randomized value functions to guide deep exploration in
reinforcement learning. This offers an elegant means for synthesizing
statistically and computationally efficient exploration with common practical
approaches to value function learning. We present several reinforcement
learning algorithms that leverage randomized value functions and demonstrate
their efficacy through computational studies. We also prove a regret bound that
establishes statistical efficiency with a tabular representation.Comment: Accepted for publication in Journal of Machine Learning Research 201
Efficient exploration with Double Uncertain Value Networks
This paper studies directed exploration for reinforcement learning agents by
tracking uncertainty about the value of each available action. We identify two
sources of uncertainty that are relevant for exploration. The first originates
from limited data (parametric uncertainty), while the second originates from
the distribution of the returns (return uncertainty). We identify methods to
learn these distributions with deep neural networks, where we estimate
parametric uncertainty with Bayesian drop-out, while return uncertainty is
propagated through the Bellman equation as a Gaussian distribution. Then, we
identify that both can be jointly estimated in one network, which we call the
Double Uncertain Value Network. The policy is directly derived from the learned
distributions based on Thompson sampling. Experimental results show that both
types of uncertainty may vastly improve learning in domains with a strong
exploration challenge.Comment: Deep Reinforcement Learning Symposium @ Conference on Neural
Information Processing Systems (NIPS) 201
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
- …