151 research outputs found

    Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

    Full text link
    We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method

    Cover Tree Bayesian Reinforcement Learning

    Get PDF
    This paper proposes an online tree-based Bayesian approach for reinforcement learning. For inference, we employ a generalised context tree model. This defines a distribution on multivariate Gaussian piecewise-linear models, which can be updated in closed form. The tree structure itself is constructed using the cover tree method, which remains efficient in high dimensional spaces. We combine the model with Thompson sampling and approximate dynamic programming to obtain effective exploration policies in unknown environments. The flexibility and computational simplicity of the model render it suitable for many reinforcement learning problems in continuous state spaces. We demonstrate this in an experimental comparison with least squares policy iteration

    The Military Inventory Routing Problem: Utilizing Heuristics within a Least Squares Temporal Differences Algorithm to Solve a Multiclass Stochastic Inventory Routing Problem with Vehicle Loss

    Get PDF
    Military commanders currently resupply forward operating bases (FOBs) from a central location within an area of operations mainly via convoy operations in a way that closely resembles vendor managed inventory practices. Commanders must decide when and how much inventory to distribute throughout their area of operations while minimizing soldier risk. Technology currently exists that makes utilizing unmanned cargo aerial vehicles (CUAVs) for resupply an attractive alternative due to the dangers of utilizing convoy operations. Enemy actions in wartime environments pose a significant risk to a CUAV\u27s ability to safely deliver supplies to a FOB. We develop a Markov decision process (MDP) model to examine this military inventory routing problem (MILIRP). In our first paper we examine the structure of the MILIRP by considering a small problem instance and prove value function monotonicity when a sufficient penalty is applied. Moreover, we develop a monotone least squares temporal differences (MLSTD) algorithm that exploits this structure and demonstrate its efficacy for approximately solving this problem class. We compare MLSTD to least squares temporal differences (LSTD), a similar ADP algorithm that does not exploit monotonicity. MLSTD attains a 3:05% optimality gap for a baseline scenario and outperforms LSTD by 31:86% on average in our computational experiments. Our second paper expands the problem complexity with additional FOBs. We generate two new algorithms, Index and Rollout, for the routing portion and implement an LSTD algorithm that utilized these to produce solutions 22% better than myopic generated solutions on average. Our third paper greatly increases problem complexity with the addition of supply classes. We formulate an MDP model to handle the increased complexity and implement our LSTD-Index and LSTD-Rollout algorithms to solve this larger problem instance and perform 21% better on average than a myopic policy

    Policy evaluation with temporal differences: a survey and comparison

    Get PDF
    Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value function, the quality assessment of states for a given policy, which can be used in a policy improvement step. Since the late 1980s, this research area has been dominated by temporal-difference (TD) methods due to their data-efficiency. However, core issues such as stability guarantees in the off-policy scenario, improved sample efficiency and probabilistic treatment of the uncertainty in the estimates have only been tackled recently, which has led to a large number of new approaches. This paper aims at making these new developments accessible in a concise overview, with foci on underlying cost functions, the off-policy scenario as well as on regularization in high dimensional feature spaces. By presenting the first extensive, systematic comparative evaluations comparing TD, LSTD, LSPE, FPKF, the residual- gradient algorithm, Bellman residual minimization, GTD, GTD2 and TDC, we shed light on the strengths and weaknesses of the methods. Moreover, we present alternative versions of LSTD and LSPE with drastically improved off-policy performance
    • …
    corecore