38,865 research outputs found

    Learning-based Control of Unknown Linear Systems with Thompson Sampling

    Full text link
    We propose a Thompson sampling-based learning algorithm for the Linear Quadratic (LQ) control problem with unknown system parameters. The algorithm is called Thompson sampling with dynamic episodes (TSDE) where two stopping criteria determine the lengths of the dynamic episodes in Thompson sampling. The first stopping criterion controls the growth rate of episode length. The second stopping criterion is triggered when the determinant of the sample covariance matrix is less than half of the previous value. We show under some conditions on the prior distribution that the expected (Bayesian) regret of TSDE accumulated up to time T is bounded by O(\sqrt{T}). Here O(.) hides constants and logarithmic factors. This is the first O(\sqrt{T} ) bound on expected regret of learning in LQ control. By introducing a reinitialization schedule, we also show that the algorithm is robust to time-varying drift in model parameters. Numerical simulations are provided to illustrate the performance of TSDE

    Posterior Sampling for Large Scale Reinforcement Learning

    Full text link
    We propose a practical non-episodic PSRL algorithm that unlike recent state-of-the-art PSRL algorithms uses a deterministic, model-independent episode switching schedule. Our algorithm termed deterministic schedule PSRL (DS-PSRL) is efficient in terms of time, sample, and space complexity. We prove a Bayesian regret bound under mild assumptions. Our result is more generally applicable to multiple parameters and continuous state action problems. We compare our algorithm with state-of-the-art PSRL algorithms on standard discrete and continuous problems from the literature. Finally, we show how the assumptions of our algorithm satisfy a sensible parametrization for a large class of problems in sequential recommendations

    Learning to Optimize via Information-Directed Sampling

    Full text link
    We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. We illustrate through simple analytic examples how information-directed sampling accounts for kinds of information that alternative approaches do not adequately address and that this can lead to dramatic performance gains. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate state-of-the-art simulation performance.Comment: arXiv admin note: substantial text overlap with arXiv:1403.534

    A Practical Method for Solving Contextual Bandit Problems Using Decision Trees

    Full text link
    Many efficient algorithms with strong theoretical guarantees have been proposed for the contextual multi-armed bandit problem. However, applying these algorithms in practice can be difficult because they require domain expertise to build appropriate features and to tune their parameters. We propose a new method for the contextual bandit problem that is simple, practical, and can be applied with little or no domain expertise. Our algorithm relies on decision trees to model the context-reward relationship. Decision trees are non-parametric, interpretable, and work well without hand-crafted features. To guide the exploration-exploitation trade-off, we use a bootstrapping approach which abstracts Thompson sampling to non-Bayesian settings. We also discuss several computational heuristics and demonstrate the performance of our method on several datasets.Comment: Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI 2017

    Generalized Thompson Sampling for Contextual Bandits

    Full text link
    Thompson Sampling, one of the oldest heuristics for solving multi-armed bandits, has recently been shown to demonstrate state-of-the-art performance. The empirical success has led to great interests in theoretical understanding of this heuristic. In this paper, we approach this problem in a way very different from existing efforts. In particular, motivated by the connection between Thompson Sampling and exponentiated updates, we propose a new family of algorithms called Generalized Thompson Sampling in the expert-learning framework, which includes Thompson Sampling as a special case. Similar to most expert-learning algorithms, Generalized Thompson Sampling uses a loss function to adjust the experts' weights. General regret bounds are derived, which are also instantiated to two important loss functions: square loss and logarithmic loss. In contrast to existing bounds, our results apply to quite general contextual bandits. More importantly, they quantify the effect of the "prior" distribution on the regret bounds

    Scalable Coordinated Exploration in Concurrent Reinforcement Learning

    Full text link
    We consider a team of reinforcement learning agents that concurrently operate in a common environment, and we develop an approach to efficient coordinated exploration that is suitable for problems of practical scale. Our approach builds on seed sampling (Dimakopoulou and Van Roy, 2018) and randomized value function learning (Osband et al., 2016). We demonstrate that, for simple tabular contexts, the approach is competitive with previously proposed tabular model learning methods (Dimakopoulou and Van Roy, 2018). With a higher-dimensional problem and a neural network value function representation, the approach learns quickly with far fewer agents than alternative exploration schemes.Comment: NIPS 201

    Bayesian model predictive control: Efficient model exploration and regret bounds using posterior sampling

    Full text link
    Tight performance specifications in combination with operational constraints make model predictive control (MPC) the method of choice in various industries. As the performance of an MPC controller depends on a sufficiently accurate objective and prediction model of the process, a significant effort in the MPC design procedure is dedicated to modeling and identification. Driven by the increasing amount of available system data and advances in the field of machine learning, data-driven MPC techniques have been developed to facilitate the MPC controller design. While these methods are able to leverage available data, they typically do not provide principled mechanisms to automatically trade off exploitation of available data and exploration to improve and update the objective and prediction model. To this end, we present a learning-based MPC formulation using posterior sampling techniques, which provides finite-time regret bounds on the learning performance while being simple to implement using off-the-shelf MPC software and algorithms. The performance analysis of the method is based on posterior sampling theory and its practical efficiency is illustrated using a numerical example of a highly nonlinear dynamical car-trailer system

    Estimation Considerations in Contextual Bandits

    Full text link
    Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We study a consideration for the exploration vs. exploitation framework that does not arise in multi-armed bandits but is crucial in contextual bandits; the way exploration and exploitation is conducted in the present affects the bias and variance in the potential outcome model estimation in subsequent stages of learning. We develop parametric and non-parametric contextual bandits that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for contextual bandits with balancing in the domain of linear contextual bandits that match the state of the art regret bounds. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model mis-specification and prejudice in the initial training data. Additionally, we develop contextual bandits with simpler assignment policies by leveraging sparse model estimation methods from the econometrics literature and demonstrate empirically that in the early stages they can improve the rate of learning and decrease regret

    Note on Thompson sampling for large decision problems

    Full text link
    There is increasing interest in using streaming data to inform decision making across a wide range of application domains including mobile health, food safety, security, and resource management. A decision support system formalizes online decision making as a map from up-to-date information to a recommended decision. Online estimation of an optimal decision strategy from streaming data requires simultaneous estimation of components of the underlying system dynamics as well as the optimal decision strategy given these dynamics; thus, there is an inherent trade-off between choosing decisions that lead to improved estimates and choosing decisions that appear to be optimal based on current estimates. Thompson (1933) was among the first to formalize this trade-off in the context of choosing between two treatments for a stream of patients; he proposed a simple heuristic wherein a treatment is selected randomly at each time point with selection probability proportional to the posterior probability that it is optimal. We consider a variant of Thompson sampling that is simple to implement and can be applied to large and complex decision problems. We show that the proposed Thompson sampling estimator is consistent for the optimal decision support system and provide rates of convergence and finite sample error bounds. The proposed algorithm is illustrated using an agent-based model of the spread of influenza on a network and management of mallard populations in the United States

    A Tour of Reinforcement Learning: The View from Continuous Control

    Full text link
    This manuscript surveys reinforcement learning from the perspective of optimization and control with a focus on continuous control applications. It surveys the general formulation, terminology, and typical experimental implementations of reinforcement learning and reviews competing solution paradigms. In order to compare the relative merits of various techniques, this survey presents a case study of the Linear Quadratic Regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. The manuscript describes how merging techniques from learning theory and control can provide non-asymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. This survey concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
    • …
    corecore