1,542 research outputs found

    Finite-Time Analysis of Temporal Difference Learning: Discrete-Time Linear System Perspective

    Full text link
    TD-learning is a fundamental algorithm in the field of reinforcement learning (RL), that is employed to evaluate a given policy by estimating the corresponding value function for a Markov decision process. While significant progress has been made in the theoretical analysis of TD-learning, recent research has uncovered guarantees concerning its statistical efficiency by developing finite-time error bounds. This paper aims to contribute to the existing body of knowledge by presenting a novel finite-time analysis of tabular temporal difference (TD) learning, which makes direct and effective use of discrete-time stochastic linear system models and leverages Schur matrix properties. The proposed analysis can cover both on-policy and off-policy settings in a unified manner. By adopting this approach, we hope to offer new and straightforward templates that not only shed further light on the analysis of TD-learning and related RL algorithms but also provide valuable insights for future research in this domain.Comment: arXiv admin note: text overlap with arXiv:2112.1441

    A Finite Time Analysis of Two Time-Scale Actor Critic Methods

    Full text link
    Actor-critic (AC) methods have exhibited great empirical success compared with other reinforcement learning algorithms, where the actor uses the policy gradient to improve the learning policy and the critic uses temporal difference learning to estimate the policy gradient. Under the two time-scale learning rate schedule, the asymptotic convergence of AC has been well studied in the literature. However, the non-asymptotic convergence and finite sample complexity of actor-critic methods are largely open. In this work, we provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting. We prove that the actor-critic method is guaranteed to find a first-order stationary point (i.e., ∥∇J(θ)∥22≤ϵ\|\nabla J(\boldsymbol{\theta})\|_2^2 \le \epsilon) of the non-concave performance function J(θ)J(\boldsymbol{\theta}), with O~(ϵ−2.5)\mathcal{\tilde{O}}(\epsilon^{-2.5}) sample complexity. To the best of our knowledge, this is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.Comment: 45 page

    Search, navigation and foraging: an optimal decision-making perspective

    Get PDF
    Behavior in its general form can be defined as a mapping between sensory inputs and a pattern of motor actions that are used to achieve a goal. Reinforcement learning in the last years emerged as a general framework to analyze behavior in its general definition. In this thesis exploiting the techniques of reinforcement learning we study several phenomena that can be classified as search, navigation and foraging behaviors. Regarding the search aspect we analyze random walks forced to reach a target in a confined region of the space. In this case we can solve analytically the problem that allows to find a very efficient way to generate such walks. The navigation problem is inspired by olfactory navigation in homing pigeons. In this case we propose an algorithm to navigate a noisy environment relying only on local signals. The foraging instead is analyzed starting from the observation that fossil traces show the evolution of foraging strategies towards highly compact and self-avoiding trajectories. We show how this optimal behavior can emerge in the reinforcement learning framework
    • …
    corecore