9,573 research outputs found

    Projections for Approximate Policy Iteration Algorithms

    Get PDF
    Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms

    Projections for Approximate Policy Iteration Algorithms

    Get PDF
    Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms

    Controlled Sequential Monte Carlo

    Full text link
    Sequential Monte Carlo methods, also known as particle methods, are a popular set of techniques for approximating high-dimensional probability distributions and their normalizing constants. These methods have found numerous applications in statistics and related fields; e.g. for inference in non-linear non-Gaussian state space models, and in complex static models. Like many Monte Carlo sampling schemes, they rely on proposal distributions which crucially impact their performance. We introduce here a class of controlled sequential Monte Carlo algorithms, where the proposal distributions are determined by approximating the solution to an associated optimal control problem using an iterative scheme. This method builds upon a number of existing algorithms in econometrics, physics, and statistics for inference in state space models, and generalizes these methods so as to accommodate complex static models. We provide a theoretical analysis concerning the fluctuation and stability of this methodology that also provides insight into the properties of related algorithms. We demonstrate significant gains over state-of-the-art methods at a fixed computational complexity on a variety of applications

    On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach

    Full text link
    We consider the problem of rate and power allocation in a multiple-access channel. Our objective is to obtain rate and power allocation policies that maximize a general concave utility function of average transmission rates on the information theoretic capacity region of the multiple-access channel. Our policies does not require queue-length information. We consider several different scenarios. First, we address the utility maximization problem in a nonfading channel to obtain the optimal operating rates, and present an iterative gradient projection algorithm that uses approximate projection. By exploiting the polymatroid structure of the capacity region, we show that the approximate projection can be implemented in time polynomial in the number of users. Second, we consider resource allocation in a fading channel. Optimal rate and power allocation policies are presented for the case that power control is possible and channel statistics are available. For the case that transmission power is fixed and channel statistics are unknown, we propose a greedy rate allocation policy and provide bounds on the performance difference of this policy and the optimal policy in terms of channel variations and structure of the utility function. We present numerical results that demonstrate superior convergence rate performance for the greedy policy compared to queue-length based policies. In order to reduce the computational complexity of the greedy policy, we present approximate rate allocation policies which track the greedy policy within a certain neighborhood that is characterized in terms of the speed of fading.Comment: 32 pages, Submitted to IEEE Trans. on Information Theor

    Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view

    Get PDF
    We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the one-step Temporal Difference fix-point computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of (schoknecht,2002) and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average

    On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

    Get PDF
    We consider infinite-horizon stationary γ\gamma-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error ϵ\epsilon at each iteration, it is well-known that one can compute stationary policies that are 2γ(1−γ)2ϵ\frac{2\gamma}{(1-\gamma)^2}\epsilon-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iteration for computing non-stationary policies that can be up to 2γ1−γϵ\frac{2\gamma}{1-\gamma}\epsilon-optimal, which constitutes a significant improvement in the usual situation when γ\gamma is close to 1. Surprisingly, this shows that the problem of "computing near-optimal non-stationary policies" is much simpler than that of "computing near-optimal stationary policies"

    Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

    Full text link
    We address the problem of automatic generation of features for value function approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve the error of policy evaluation with function approximation, with a convergence rate similar to that of value iteration. We propose a simple, fast and robust algorithm based on random projections to generate BEBFs for sparse feature spaces. We provide a finite sample analysis of the proposed method, and prove that projections logarithmic in the dimension of the original space are enough to guarantee contraction in the error. Empirical results demonstrate the strength of this method

    Computing Probabilistic Bisimilarity Distances for Probabilistic Automata

    Get PDF
    The probabilistic bisimilarity distance of Deng et al. has been proposed as a robust quantitative generalization of Segala and Lynch's probabilistic bisimilarity for probabilistic automata. In this paper, we present a characterization of the bisimilarity distance as the solution of a simple stochastic game. The characterization gives us an algorithm to compute the distances by applying Condon's simple policy iteration on these games. The correctness of Condon's approach, however, relies on the assumption that the games are stopping. Our games may be non-stopping in general, yet we are able to prove termination for this extended class of games. Already other algorithms have been proposed in the literature to compute these distances, with complexity in UP∩coUP\textbf{UP} \cap \textbf{coUP} and \textbf{PPAD}. Despite the theoretical relevance, these algorithms are inefficient in practice. To the best of our knowledge, our algorithm is the first practical solution. The characterization of the probabilistic bisimilarity distance mentioned above crucially uses a dual presentation of the Hausdorff distance due to M\'emoli. As an additional contribution, in this paper we show that M\'emoli's result can be used also to prove that the bisimilarity distance bounds the difference in the maximal (or minimal) probability of two states to satisfying arbitrary ω\omega-regular properties, expressed, eg., as LTL formulas
    • …
    corecore