99,079 research outputs found

    Two Timescale Stochastic Approximation with Controlled Markov noise and Off-policy temporal difference learning

    Full text link
    We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in addition to martingale difference noise. We analyze the asymptotic behavior of our framework by relating it to limiting differential inclusions in both time-scales that are defined in terms of the ergodic occupation measures associated with the controlled Markov processes. Finally, we present a solution to the off-policy convergence problem for temporal difference learning with linear function approximation, using our results.Comment: 23 pages (relaxed some important assumptions from the previous version), accepted in Mathematics of Operations Research in Feb, 201

    The Essential Dynamics Algorithm: Essential Results

    Get PDF
    This paper presents a novel algorithm for learning in a class of stochastic Markov decision processes (MDPs) with continuous state and action spaces that trades speed for accuracy. A transform of the stochastic MDP into a deterministic one is presented which captures the essence of the original dynamics, in a sense made precise. In this transformed MDP, the calculation of values is greatly simplified. The online algorithm estimates the model of the transformed MDP and simultaneously does policy search against it. Bounds on the error of this approximation are proven, and experimental results in a bicycle riding domain are presented. The algorithm learns near optimal policies in orders of magnitude fewer interactions with the stochastic MDP, using less domain knowledge. All code used in the experiments is available on the project's web site

    Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning

    Full text link
    The stochastic approximation algorithm is a widely used probabilistic method for finding a zero of a vector-valued funtion, when only noisy measurements of the function are available. In the literature to date, one can make a distinction between "synchronous" updating, whereby every component of the current guess is updated at each time, and `"synchronous" updating, whereby only one component is updated. In principle, it is also possible to update, at each time instant, some but not all components of θt\theta_t, which might be termed as "batch asynchronous stochastic approximation" (BASA). Also, one can also make a distinction between using a "local" clock versus a "global" clock. In this paper, we propose a unified formulation of batch asynchronous stochastic approximation (BASA) algorithms, and develop a general methodology for proving that such algorithms converge, irrespective of whether global or local clocks are used. These convergence proofs make use of weaker hypotheses than existing results. For example: existing convergence proofs when a local clock is used require that the measurement noise is an i.i.d sequence. Here, it is assumed that the measurement errors form a martingale difference sequence. Also, all results to date assume that the stochastic step sizes satisfy a probabilistic analog of the Robbins-Monro conditions. We replace this by a purely deterministic condition on the irreducibility of the underlying Markov processes. As specific applications to Reinforcement Learning, we introduce ``batch'' versions of the temporal difference algorithm TD(0)TD(0) for value iteration, and the QQ-learning algorithm for finding the optimal action-value function, and also permit the use of local clocks instead of a global clock. In all cases, we establish the convergence of these algorithms, under milder conditions than in the existing literature.Comment: 27 page

    Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

    Full text link
    The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either as the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., mean-variance tradeoff, exponential utility, the percentile performance, value at risk, conditional value at risk, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk and cumulative prospect theory, and present a template for a risk-sensitive RL algorithm. We survey some of our recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures in a constrained framework. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving a risk-sensitive RL problem, and outlining some potential future research directions

    Diffusion Approximations for Online Principal Component Estimation and Global Convergence

    Full text link
    In this paper, we propose to adopt the diffusion approximation tools to study the dynamics of Oja's iteration which is an online stochastic gradient descent method for the principal component analysis. Oja's iteration maintains a running estimate of the true principal component from streaming data and enjoys less temporal and spatial complexities. We show that the Oja's iteration for the top eigenvector generates a continuous-state discrete-time Markov chain over the unit sphere. We characterize the Oja's iteration in three phases using diffusion approximation and weak convergence tools. Our three-phase analysis further provides a finite-sample error bound for the running estimate, which matches the minimax information lower bound for principal component analysis under the additional assumption of bounded samples.Comment: Appeared in NIPS 201
    corecore