74,124 research outputs found

    The Role of Lookahead and Approximate Policy Evaluation in Reinforcement Learning with Linear Value Function Approximation

    Full text link
    Function approximation is widely used in reinforcement learning to handle the computational difficulties associated with very large state spaces. However, function approximation introduces errors which may lead to instabilities when using approximate dynamic programming techniques to obtain the optimal policy. Therefore, techniques such as lookahead for policy improvement and m-step rollout for policy evaluation are used in practice to improve the performance of approximate dynamic programming with function approximation. We quantitatively characterize, for the first time, the impact of lookahead and m-step rollout on the performance of approximate dynamic programming (DP) with function approximation: (i) without a sufficient combination of lookahead and m-step rollout, approximate DP may not converge, (ii) both lookahead and m-step rollout improve the convergence rate of approximate DP, and (iii) lookahead helps mitigate the effect of function approximation and the discount factor on the asymptotic performance of the algorithm. Our results are presented for two approximate DP methods: one which uses least-squares regression to perform function approximation and another which performs several steps of gradient descent of the least-squares objective in each iteration.Comment: 36 pages, 4 figure

    Dynamic shortest path problem with travel-time-dependent stochastic disruptions : hybrid approximate dynamic programming algorithms with a clustering approach

    Get PDF
    We consider a dynamic shortest path problem with stochastic disruptions in the network. We use both historical information and real-time information of the network for the dynamic routing decisions. We model the problem as a discrete time nite horizon Markov Decision Process (MDP). For networks with many levels of disruptions, the MDP faces the curses of dimensionality. We rst apply Approximate Dynamic Programming (ADP) algorithm with a standard value function approximation. Then, we improve the ADP algorithm by exploiting the structure of the disruption transition functions. We develop a hybrid ADP with a clustering approach using both a deterministic lookahead policy and a value function approximation. We develop a test bed of networks to evaluate the quality of the solutions. The hybrid ADP algorithm with clustering approach signicantly reduces the computational time, while stil providing good quality solutions. Keywords: Dynamic shortest path problem, Approximate Dynamic Programming, Disruption handling, Clusterin

    Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee

    Get PDF
    Local Policy Search is a popular reinforcement learning approach for handling large state spaces. Formally, it searches locally in a paramet erized policy space in order to maximize the associated value function averaged over some predefined distribution. It is probably commonly b elieved that the best one can hope in general from such an approach is to get a local optimum of this criterion. In this article, we show th e following surprising result: \emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance guarantee}. We compare this g uarantee with the one that is satisfied by Direct Policy Iteration, an approximate dynamic programming algorithm that does some form of Poli cy Search: if the approximation error of Local Policy Search may generally be bigger (because local search requires to consider a space of s tochastic policies), we argue that the concentrability coefficient that appears in the performance bound is much nicer. Finally, we discuss several practical and theoretical consequences of our analysis

    Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

    Full text link
    Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains

    Approximation Benefits of Policy Gradient Methods with Aggregated States

    Full text link
    Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by ϵ\epsilon, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as ϵ/(1γ)\epsilon/(1-\gamma), where γ\gamma is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming
    corecore