3,627 research outputs found
Control Regularization for Reduced Variance Reinforcement Learning
Dealing with high variance is a significant challenge in model-free
reinforcement learning (RL). Existing methods are unreliable, exhibiting high
variance in performance from run to run using different initializations/seeds.
Focusing on problems arising in continuous control, we propose a functional
regularization approach to augmenting model-free RL. In particular, we
regularize the behavior of the deep policy to be similar to a policy prior,
i.e., we regularize in function space. We show that functional regularization
yields a bias-variance trade-off, and propose an adaptive tuning strategy to
optimize this trade-off. When the policy prior has control-theoretic stability
guarantees, we further show that this regularization approximately preserves
those stability guarantees throughout learning. We validate our approach
empirically on a range of settings, and demonstrate significantly reduced
variance, guaranteed dynamic stability, and more efficient learning than deep
RL alone.Comment: Appearing in ICML 201
Bayesian Policy Gradients via Alpha Divergence Dropout Inference
Policy gradient methods have had great success in solving continuous control
tasks, yet the stochastic nature of such problems makes deterministic value
estimation difficult. We propose an approach which instead estimates a
distribution by fitting the value function with a Bayesian Neural Network. We
optimize an -divergence objective with Bayesian dropout approximation
to learn and estimate this distribution. We show that using the Monte Carlo
posterior mean of the Bayesian value function distribution, rather than a
deterministic network, improves stability and performance of policy gradient
methods in continuous control MuJoCo simulations.Comment: Accepted to Bayesian Deep Learning Workshop at NIPS 201
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
- …