19 research outputs found

    A Theory of Model Selection in Reinforcement Learning

    Full text link
    Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to accomplish sequential decision-making tasks from experience. Applications of RL are found in robotics and control, dialog systems, medical treatment, etc. Despite the generality of the framework, most empirical successes of RL to-date are restricted to simulated environments, where hyperparameters are tuned by trial and error using large amounts of data. In contrast, collecting data with active intervention in the real world can be costly, time-consuming, and sometimes unsafe. Choosing the hyperparameters and understanding their effects in face of these data limitations, i.e., model selection, is an important yet open direction that we need to study to enable such applications of RL, which is the main theme of this thesis. More concretely, this thesis presents theoretical results that improve our understanding of 3 hyperparameters in RL: planning horizon, state representation (abstraction), and reward function. The 1st part of the thesis focuses on the interplay between planning horizon and limited amount of data, and establishes a formal explanation for how a long planning horizon can cause overfitting. The 2nd part considers the problem of choosing the right state abstraction using limited batch data; I show that cross-validation type methods require importance sampling and suffer from exponential variance, and a novel regularization-based algorithm enjoys an oracle-like property. The 3rd part investigates reward misspecification and tries to resolve it by leveraging expert demonstrations, which is inspired by AI safety concerns and bears close connections to inverse reinforcement learning. A recurring theme of the thesis is the deployment of formulations and techniques from other machine learning theory (mostly statistical learning theory): the planning horizon work explains the overfitting phenomenon by making a formal analogy to empirical risk minimization and by proving planning loss bounds that are similar to generalization error bounds; the main result in the abstraction selection work takes the form of an oracle inequality, which is a concept from structural risk minimization for model selection in supervised learning; the inverse RL work provides a mistake-bound type analysis under arbitrarily chosen environments, which can be viewed as a form of no-regret learning. Overall, by borrowing ideas from mature theories of machine learning, we can develop analogies for RL that allow us to better understand the impact of hyperparameters, and develop algorithms that automatically set them in an effective manner.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138518/1/nanjiang_1.pd

    A Dantzig Selector Approach to Temporal Difference Learning

    Full text link
    LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but it solves a fixed--point problem, its integration with L1-regularization is not straightforward and might come with some drawbacks (e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. We investigate the performance of the proposed algorithm and its relationship with the existing regularized approaches, and show how it addresses some of their drawbacks.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

    Full text link
    We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method

    Un sélecteur de Dantzig pour l'apprentissage par différences temporelles

    Get PDF
    National audienceEn apprentissage par renforcement, LSTD est l'un des algorithmes d'approximation de la fonction de valeur les plus populaires. Lorsqu'il y a plus de fonctions de base que d'exemples, un problème se pose, qui peut être traité en combinant LSTD avec une forme de régularisation. En particulier, les méthodes de régularisation 1 tendent à sélectionner les fonctions de base (en favorisant la parcimonie des solutions) et sont donc particulièrement adaptées pour les problèmes de grande dimension. Toutefois, LSTD n'est pas un simple algorithme de régression ; il résout un problème de point fixe, l'intégration d'une régularisation 1 n'est pas évidente et peut entraîner certains inconvénients (comme l'hypothèse de P-matrice pour LASSO-TD). Cette contribution introduit un nouvel algorithme qui intègre LSTD au sélecteur de Dantzig, généralisant ce dernier à l'apprentissage par différences temporelles. En particulier, nous étudions les performances de l'algorithme proposé ainsi que son lien avec les approches de l'état de l'art, notamment la façon dont il surmonte certains inconvénients des solutions existantes
    corecore