38,205 research outputs found
A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning
We present a tutorial on Bayesian optimization, a method of finding the
maximum of expensive cost functions. Bayesian optimization employs the Bayesian
technique of setting a prior over the objective function and combining it with
evidence to get a posterior function. This permits a utility-based selection of
the next observation to make on the objective function, which must take into
account both exploration (sampling from areas of high uncertainty) and
exploitation (sampling areas likely to offer improvement over the current best
observation). We also present two detailed extensions of Bayesian optimization,
with experiments---active user modelling with preferences, and hierarchical
reinforcement learning---and a discussion of the pros and cons of Bayesian
optimization based on our experiences
Bayesian Policy Gradients via Alpha Divergence Dropout Inference
Policy gradient methods have had great success in solving continuous control
tasks, yet the stochastic nature of such problems makes deterministic value
estimation difficult. We propose an approach which instead estimates a
distribution by fitting the value function with a Bayesian Neural Network. We
optimize an -divergence objective with Bayesian dropout approximation
to learn and estimate this distribution. We show that using the Monte Carlo
posterior mean of the Bayesian value function distribution, rather than a
deterministic network, improves stability and performance of policy gradient
methods in continuous control MuJoCo simulations.Comment: Accepted to Bayesian Deep Learning Workshop at NIPS 201
Value-Distributional Model-Based Reinforcement Learning
Quantifying uncertainty about a policy's long-term performance is important
to solve sequential decision-making tasks. We study the problem from a
model-based Bayesian reinforcement learning perspective, where the goal is to
learn the posterior distribution over value functions induced by parameter
(epistemic) uncertainty of the Markov decision process. Previous work restricts
the analysis to a few moments of the distribution over values or imposes a
particular distribution shape, e.g., Gaussians. Inspired by distributional
reinforcement learning, we introduce a Bellman operator whose fixed-point is
the value distribution function. Based on our theory, we propose Epistemic
Quantile-Regression (EQR), a model-based algorithm that learns a value
distribution function that can be used for policy optimization. Evaluation
across several continuous-control tasks shows performance benefits with respect
to established model-based and model-free algorithms
- …