59 research outputs found

    Factored Value Iteration Converges

    Get PDF
    In this paper we propose a novel algorithm, factored value iteration (FVI), for the approximate solution of factored Markov decision processes (fMDPs). The traditional approximate value iteration algorithm is modified in two ways. For one, the least-squares projection operator is modified so that it does not increase max-norm, and thus preserves convergence. The other modification is that we uniformly sample polynomially many samples from the (exponentially large) state space. This way, the complexity of our algorithm becomes polynomial in the size of the fMDP description length. We prove that the algorithm is convergent. We also derive an upper bound on the difference between our approximate solution and the optimal one, and also on the error introduced by sampling. We analyze various projection operators with respect to their computation complexity and their convergence when combined with approximate value iteration.Comment: 17 pages, 1 figur

    Factored temporal difference learning in the new ties environment

    Get PDF
    Although reinforcement learning is a popular method for training an agent for decision making based on rewards, well studied tabular methods are not applicable for large, realistic problems. In this paper, we experiment with a factored version of temporal difference learning, which boils down to a linear function approximation scheme utilising natural features coming from the structure of the task. We conducted experiments in the New Ties environment, which is a novel platform for multi-agent simulations. We show that learning utilising a factored representation is effective even in large state spaces, furthermore it outperforms tabular methods even in smaller problems both in learning speed and stability, because of its generalisation capabilities

    Planning under risk and uncertainty

    Get PDF
    This thesis concentrates on the optimization of large-scale management policies under conditions of risk and uncertainty. In paper I, we address the problem of solving large-scale spatial and temporal natural resource management problems. To model these types of problems, the framework of graph-based Markov decision processes (GMDPs) can be used. Two algorithms for computation of high-quality management policies are presented: the first is based on approximate linear programming (ALP) and the second is based on mean-field approximation and approximate policy iteration (MF-API). The applicability and efficiency of the algorithms were demonstrated by their ability to compute near-optimal management policies for two large-scale management problems. It was concluded that the two algorithms compute policies of similar quality. However, the MF-API algorithm should be used when both the policy and the expected value of the computed policy are required, while the ALP algorithm may be preferred when only the policy is required. In paper II, a number of reinforcement learning algorithms are presented that can be used to compute management policies for GMDPs when the transition function can only be simulated because its explicit formulation is unknown. Studies of the efficiency of the algorithms for three management problems led us to conclude that some of these algorithms were able to compute near-optimal management policies. In paper III, we used the GMDP framework to optimize long-term forestry management policies under stochastic wind-damage events. The model was demonstrated by a case study of an estate consisting of 1,200 ha of forest land, divided into 623 stands. We concluded that managing the estate according to the risk of wind damage increased the expected net present value (NPV) of the whole estate only slightly, less than 2%, under different wind-risk assumptions. Most of the stands were managed in the same manner as when the risk of wind damage was not considered. However, the analysis rests on properties of the model that need to be refined before definite conclusions can be drawn

    Improving the Practicality of Model-Based Reinforcement Learning: An Investigation into Scaling up Model-Based Methods in Online Settings

    Get PDF
    This thesis is a response to the current scarcity of practical model-based control algorithms in the reinforcement learning (RL) framework. As of yet there is no consensus on how best to integrate imperfect transition models into RL whilst mitigating policy improvement instabilities in online settings. Current state-of-the-art policy learning algorithms that surpass human performance often rely on model-free approaches that enjoy unmitigated sampling of transition data. Model-based RL (MBRL) instead attempts to distil experience into transition models that allow agents to plan new policies without needing to return to the environment and sample more data. The initial focus of this investigation is on kernel conditional mean embeddings (CMEs) (Song et al., 2009) deployed in an approximate policy iteration (API) algorithm (Grünewälder et al., 2012a). This existing MBRL algorithm boasts theoretically stable policy updates in continuous state and discrete action spaces. The Bellman operator’s value function and (transition) conditional expectation are modelled and embedded respectively as functions in a reproducing kernel Hilbert space (RKHS). The resulting finite-induced approximate pseudo-MDP (Yao et al., 2014a) can be solved exactly in a dynamic programming algorithm with policy improvement suboptimality guarantees. However model construction and policy planning scale cubically and quadratically respectively with the training set size, rendering the CME impractical for sampleabundant tasks in online settings. Three variants of CME API are investigated to strike a balance between stable policy updates and reduced computational complexity. The first variant models the value function and state-action representation explicitly in a parametric CME (PCME) algorithm with favourable computational complexity. However a soft conservative policy update technique is developed to mitigate policy learning oscillations in the planning process. The second variant returns to the non-parametric embedding and contributes (along with external work) to the compressed CME (CCME); a sparse and computationally more favourable CME. The final variant is a fully end-to-end differentiable embedding trained with stochastic gradient updates. The value function remains modelled in an RKHS such that backprop is driven by a non-parametric RKHS loss function. Actively compressed CME (ACCME) satisfies the pseudo-MDP contraction constraint using a sparse softmax activation function. The size of the pseudo-MDP (i.e. the size of the embedding’s last layer) is controlled by sparsifying the last layer weight matrix by extending the truncated gradient method (Langford et al., 2009) with group lasso updates in a novel ‘use it or lose it’ neuron pruning mechanism. Surprisingly this technique does not require extensive fine-tuning between control tasks

    Applications of Probabilistic Inference to Planning & Reinforcement Learning

    Get PDF
    Optimal control is a profound and fascinating subject that regularly attracts interest from numerous scien- tific disciplines, including both pure and applied Mathematics, Computer Science, Artificial Intelligence, Psychology, Neuroscience and Economics. In 1960 Rudolf Kalman discovered that there exists a dual- ity between the problems of filtering and optimal control in linear systems [84]. This is now regarded as a seminal piece of work and it has since motivated a large amount of research into the discovery of similar dualities between optimal control and statistical inference. This is especially true of recent years where there has been much research into recasting problems of optimal control into problems of statis- tical/approximate inference. Broadly speaking this is the perspective that we take in this work and in particular we present various applications of methods from the fields of statistical/approximate inference to optimal control, planning and Reinforcement Learning. Some of the methods would be more accu- rately described to originate from other fields of research, such as the dual decomposition techniques used in chapter(5) which originate from convex optimisation. However, the original motivation for the application of these techniques was from the field of approximate inference. The study of dualities be- tween optimal control and statistical inference has been a subject of research for over 50 years and we do not claim to encompass the entire subject. Instead, we present what we consider to be a range of interesting and novel applications from this field of researc

    Acta Cybernetica : Volume 18. Number 4.

    Get PDF

    Primal-dual interior-point algorithms for linear programs with many inequality constraints

    Get PDF
    Linear programs (LPs) are one of the most basic and important classes of constrained optimization problems, involving the optimization of linear objective functions over sets defined by linear equality and inequality constraints. LPs have applications to a broad range of problems in engineering and operations research, and often arise as subproblems for algorithms that solve more complex optimization problems. ``Unbalanced'' inequality-constrained LPs with many more inequality constraints than variables are an important subclass of LPs. Under a basic non-degeneracy assumption, only a small number of the constraints can be active at the solution--it is only this active set that is critical to the problem description. On the other hand, the additional constraints make the problem harder to solve. While modern ``interior-point'' algorithms have become recognized as some of the best methods for solving large-scale LPs, they may not be recommended for unbalanced problems, because their per-iteration work does not scale well with the number of constraints. In this dissertation, we investigate "constraint-reduced'' interior-point algorithms designed to efficiently solve unbalanced LPs. At each iteration, these methods construct search directions based only on a small working set of constraints, while ignoring the rest. In this way, they significantly reduce their per-iteration work and, hopefully, their overall running time. In particular, we focus on constraint-reduction methods for the highly efficient primal-dual interior-point (PDIP) algorithms. We propose and analyze a convergent constraint-reduced variant of Mehrotra's predictor-corrector PDIP algorithm, the algorithm implemented in virtually every interior-point software package for linear (and convex-conic) programming. We prove global and local quadratic convergence of this algorithm under a very general class of constraint selection rules and under minimal assumptions. We also propose and analyze two regularized constraint-reduced PDIP algorithms (with similar convergence properties) designed to deal directly with a type of degeneracy that constraint-reduced interior-point algorithms are often subject to. Prior schemes for dealing with this degeneracy could end up negating the benefit of constraint-reduction. Finally, we investigate the performance of our algorithms by applying them to several test and application problems, and show that our algorithms often outperform alternative approaches
    corecore