Search CORE

13,986 research outputs found

Learning-based Control of Unknown Linear Systems with Thompson Sampling

Author: Gagrani Mukul
Jain Rahul
Ouyang Yi
Publication venue
Publication date: 12/09/2017
Field of study

We propose a Thompson sampling-based learning algorithm for the Linear Quadratic (LQ) control problem with unknown system parameters. The algorithm is called Thompson sampling with dynamic episodes (TSDE) where two stopping criteria determine the lengths of the dynamic episodes in Thompson sampling. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is triggered when the determinant of the sample covariance matrix is less than half of the previous value. We show under some conditions on the prior distribution that the expected (Bayesian) regret of TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on expected regret of learning in LQ control. By introducing a reinitialization schedule, we also show that the algorithm is robust to time-varying drift in model parameters. Numerical simulations are provided to illustrate the performance of TSDE

arXiv.org e-Print Archive

A Tour of Reinforcement Learning: The View from Continuous Control

Author: Recht Benjamin
Publication venue
Publication date: 10/11/2018
Field of study

This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo

arXiv.org e-Print Archive

Extragradient method with variance reduction for stochastic variational inequalities

Author: Iusem Alfredo
Jofré Alejandro
Oliveira Roberto I.
Thompson Philip
Publication venue
Publication date: 01/03/2017
Field of study

We propose an extragradient method with stepsizes bounded away from zero for stochastic variational inequalities requiring only pseudo-monotonicity. We provide convergence and complexity analysis, allowing for an unbounded feasible set, unbounded operator, non-uniform variance of the oracle and, also, we do not require any regularization. Alongside the stochastic approximation procedure, we iteratively reduce the variance of the stochastic error. Our method attains the optimal oracle complexity

\mathcal{O}(1/\epsilon^2)

(up to a logarithmic term) and a faster rate

\mathcal{O}(1/K)

in terms of the mean (quadratic) natural residual and the D-gap function, where

K

is the number of iterations required for a given tolerance

\epsilon>0

. Such convergence rate represents an acceleration with respect to the stochastic error. The generated sequence also enjoys a new feature: the sequence is bounded in

L^p

if the stochastic error has finite

p

-moment. Explicit estimates for the convergence rate, the oracle complexity and the

p

-moments are given depending on problem parameters and distance of the initial iterate to the solution set. Moreover, sharper constants are possible if the variance is uniform over the solution set or the feasible set. Our results provide new classes of stochastic variational inequalities for which a convergence rate of

\mathcal{O}(1/K)

holds in terms of the mean-squared distance to the solution set. Our analysis includes the distributed solution of pseudo-monotone Cartesian variational inequalities under partial coordination of parameters between users of a network.Comment: 39 pages. To appear in SIAM Journal on Optimization (submitted July 2015, accepted December 2016). Uploaded in IMPA's preprint server at http://preprint.impa.br/visualizar?id=688

arXiv.org e-Print Archive

Estimation Considerations in Contextual Bandits

Author: Athey Susan
Dimakopoulou Maria
Imbens Guido
Zhou Zhengyuan
Publication venue
Publication date: 16/12/2018
Field of study

Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We study a consideration for the exploration vs. exploitation framework that does not arise in multi-armed bandits but is crucial in contextual bandits; the way exploration and exploitation is conducted in the present affects the bias and variance in the potential outcome model estimation in subsequent stages of learning. We develop parametric and non-parametric contextual bandits that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for contextual bandits with balancing in the domain of linear contextual bandits that match the state of the art regret bounds. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model mis-specification and prejudice in the initial training data. Additionally, we develop contextual bandits with simpler assignment policies by leveraging sparse model estimation methods from the econometrics literature and demonstrate empirically that in the early stages they can improve the rate of learning and decrease regret

arXiv.org e-Print Archive

Spectral approximation properties of isogeometric analysis with variable continuity

Author: Calo Victor
Deng Quanling
Puzyrev Vladimir
Publication venue: 'Elsevier BV'
Publication date: 28/09/2017
Field of study

We study the spectral approximation properties of isogeometric analysis with local continuity reduction of the basis. Such continuity reduction results in a reduction in the interconnection between the degrees of freedom of the mesh, which allows for large savings in computational requirements during the solution of the resulting linear system. The continuity reduction results in extra degrees of freedom that modify the approximation properties of the method. The convergence rate of such refined isogeometric analysis is equivalent to that of the maximum continuity basis. We show how the breaks in continuity and inhomogeneity of the basis lead to artefacts in the frequency spectra, such as stopping bands and outliers, and present a unified description of these effects in finite element method, isogeometric analysis, and refined isogeometric analysis. Accuracy of the refined isogeometric analysis approximations can be improved by using non-standard quadrature rules. In particular, optimal quadrature rules lead to large reductions in the eigenvalue errors and yield two extra orders of convergence similar to classical isogeometric analysis

arXiv.org e-Print Archive

Optimal Reinforcement Learning for Gaussian Systems

Author: Hennig Philipp
Publication venue
Publication date: 14/10/2011
Field of study

The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A first order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an infinite-dimensional partial differential equation. An approximate finite-dimensional projection gives an impression for how this result may be helpful.Comment: final pre-conference version of this NIPS 2011 paper. Once again, please note some nontrivial changes to exposition and interpretation of the results, in particular in Equation (9) and Eqs. 11-14. The algorithm and results have remained the same, but their theoretical interpretation has change

arXiv.org e-Print Archive

MPG.PuRe

Horde of Bandits using Gaussian Markov Random Fields

Author: Lakshmanan Laks V. S.
Schmidt Mark
Vaswani Sharan
Publication venue
Publication date: 07/03/2017
Field of study

The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual bandits framework that shares information between a set of bandit problems, related by a known (possibly noisy) graph. This model is useful in problems like recommender systems where the large number of users makes it vital to transfer information between users. Despite its effectiveness, the existing GOB model can only be applied to small problems due to its quadratic time-dependence on the number of nodes. Existing solutions to combat the scalability issue require an often-unrealistic clustering assumption. By exploiting a connection to Gaussian Markov random fields (GMRFs), we show that the GOB model can be made to scale to much larger graphs without additional assumptions. In addition, we propose a Thompson sampling algorithm which uses the recent GMRF sampling-by-perturbation technique, allowing it to scale to even larger problems (leading to a "horde" of bandits). We give regret bounds and experimental results for GOB with Thompson sampling and epoch-greedy algorithms, indicating that these methods are as good as or significantly better than ignoring the graph or adopting a clustering-based approach. Finally, when an existing graph is not available, we propose a heuristic for learning it on the fly and show promising results

arXiv.org e-Print Archive

Exploration versus exploitation in reinforcement learning: a stochastic control approach

Author: Wang Haoran
Zariphopoulou Thaleia
Zhou Xunyu
Publication venue
Publication date: 13/02/2019
Field of study

We consider reinforcement learning (RL) in continuous time and study the problem of achieving the best trade-off between exploration of a black box environment and exploitation of current knowledge. We propose an entropy-regularized reward function involving the differential entropy of the distributions of actions, and motivate and devise an exploratory formulation for the feature dynamics that captures repetitive learning under exploration. The resulting optimization problem is a revitalization of the classical relaxed stochastic control. We carry out a complete analysis of the problem in the linear--quadratic (LQ) setting and deduce that the optimal feedback control distribution for balancing exploitation and exploration is Gaussian. This in turn interprets and justifies the widely adopted Gaussian exploration in RL, beyond its simplicity for sampling. Moreover, the exploitation and exploration are captured, respectively and mutual-exclusively, by the mean and variance of the Gaussian distribution. We also find that a more random environment contains more learning opportunities in the sense that less exploration is needed. We characterize the cost of exploration, which, for the LQ case, is shown to be proportional to the entropy regularization weight and inversely proportional to the discount rate. Finally, as the weight of exploration decays to zero, we prove the convergence of the solution of the entropy-regularized LQ problem to the one of the classical LQ problem

arXiv.org e-Print Archive

Posterior Sampling for Large Scale Reinforcement Learning

Author: Abbasi-Yadkori Yasin
Theocharous Georgios
Vlassis Nikos
Wen Zheng
Publication venue
Publication date: 22/10/2018
Field of study

We propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-art PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations

arXiv.org e-Print Archive

Probabilistic Programming with Gaussian Process Memoization

Author: Mansinghka Vikash K.
Radul Alexey
Schaechtle Ulrich
Stathis Kostas
Zinberg Ben
Publication venue
Publication date: 05/01/2016
Field of study

Gaussian Processes (GPs) are widely used tools in statistics, machine learning, robotics, computer vision, and scientific computation. However, despite their popularity, they can be difficult to apply; all but the simplest classification or regression applications require specification and inference over complex covariance functions that do not admit simple analytical posteriors. This paper shows how to embed Gaussian processes in any higher-order probabilistic programming language, using an idiom based on memoization, and demonstrates its utility by implementing and extending classic and state-of-the-art GP applications. The interface to Gaussian processes, called gpmem, takes an arbitrary real-valued computational process as input and returns a statistical emulator that automatically improve as the original process is invoked and its input-output behavior is recorded. The flexibility of gpmem is illustrated via three applications: (i) robust GP regression with hierarchical hyper-parameter learning, (ii) discovering symbolic expressions from time-series data by fully Bayesian structure learning over kernels generated by a stochastic grammar, and (iii) a bandit formulation of Bayesian optimization with automatic inference and action selection. All applications share a single 50-line Python library and require fewer than 20 lines of probabilistic code each.Comment: 36 pages, 9 figure

arXiv.org e-Print Archive