212 research outputs found

    On the Convergence of Techniques that Improve Value Iteration

    Get PDF
    Prioritisation of Bellman backups or updating only a small subset of actions represent important techniques for speeding up planning in MDPs. The recent literature showed new efficient approaches which exploit these directions. Backward value iteration and backing up only the best actions were shown to lead to a significant reduction of the planning time. This paper conducts a theoretical and empirical analysis of these techniques and shows new important proofs. In particular, (1) it identifies weaker requirements for the convergence of backups based on best actions only, (2) a new method for evaluation of the Bellman error is shown for the update that updates one best action once, (3) it presents the theoretical proof of backward value iteration and establishes required initialisation, (4) and shows that the default state ordering of backups in standard value iteration can significantly influence its performance. Additionally, (5) the existing literature did not compare these methods, either empirically or analytically, against policy iteration. The rigorous empirical and novel theoretical parts of the paper reveal important associations and allow drawing guidelines on which type of value or policy iteration is suitable for a given domain. Finally, our chief message is that standard value iteration can be made far more efficient by simple modifications shown in the paper

    Vision-based reinforcement learning using approximate policy iteration

    Get PDF
    A major issue for reinforcement learning (RL) applied to robotics is the time required to learn a new skill. While RL has been used to learn mobile robot control in many simulated domains, applications involving learning on real robots are still relatively rare. In this paper, the Least-Squares Policy Iteration (LSPI) reinforcement learning algorithm and a new model-based algorithm Least-Squares Policy Iteration with Prioritized Sweeping (LSPI+), are implemented on a mobile robot to acquire new skills quickly and efficiently. LSPI+ combines the benefits of LSPI and prioritized sweeping, which uses all previous experience to focus the computational effort on the most “interesting” or dynamic parts of the state space. The proposed algorithms are tested on a household vacuum cleaner robot for learning a docking task using vision as the only sensor modality. In experiments these algorithms are compared to other model-based and model-free RL algorithms. The results show that the number of trials required to learn the docking task is significantly reduced using LSPI compared to the other RL algorithms investigated, and that LSPI+ further improves on the performance of LSPI

    Adapting an MDP planner to time-dependency: case study on a UAV coordination problem

    Get PDF
    In order to allow the temporal coordination of two independent communicating agents, one needs to be able to plan in a time-dependent environment. This paper deals with the modeling and solving of such problems through the use of Time-dependent Markov Decision Processes (TiMDPs). We provide an analysis of the TiMDP model and exploit its properties to introduce an improved asynchronous value iteration method. Our approach is evaluated on a UAV temporal coordination problem and on the well-known Mars rover domain

    Reinforcement Learning: A Survey

    Full text link
    This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word ``reinforcement.'' The paper discusses central issues of reinforcement learning, including trading off exploration and exploitation, establishing the foundations of the field via Markov decision theory, learning from delayed reinforcement, constructing empirical models to accelerate learning, making use of generalization and hierarchy, and coping with hidden state. It concludes with a survey of some implemented systems and an assessment of the practical utility of current methods for reinforcement learning.Comment: See http://www.jair.org/ for any accompanying file

    Perseus: Randomized Point-based Value Iteration for POMDPs

    Full text link
    Partially observable Markov decision processes (POMDPs) form an attractive and principled framework for agent planning under uncertainty. Point-based approximate techniques for POMDPs compute a policy based on a finite set of points collected in advance from the agents belief space. We present a randomized point-based value iteration algorithm called Perseus. The algorithm performs approximate value backup stages, ensuring that in each backup stage the value of each point in the belief set is improved; the key observation is that a single backup may improve the value of many belief points. Contrary to other point-based methods, Perseus backs up only a (randomly selected) subset of points in the belief set, sufficient for improving the value of each belief point in the set. We show how the same idea can be extended to dealing with continuous action spaces. Experimental results show the potential of Perseus in large scale POMDP problems

    Prioritized memory access explains planning and hippocampal replay.

    Get PDF
    To make decisions, animals must evaluate candidate choices by accessing memories of relevant experiences. Yet little is known about which experiences are considered or ignored during deliberation, which ultimately governs choice. We propose a normative theory predicting which memories should be accessed at each moment to optimize future decisions. Using nonlocal 'replay' of spatial locations in hippocampus as a window into memory access, we simulate a spatial navigation task in which an agent accesses memories of locations sequentially, ordered by utility: how much extra reward would be earned due to better choices. This prioritization balances two desiderata: the need to evaluate imminent choices versus the gain from propagating newly encountered information to preceding locations. Our theory offers a simple explanation for numerous findings about place cells; unifies seemingly disparate proposed functions of replay including planning, learning, and consolidation; and posits a mechanism whose dysfunction may underlie pathologies like rumination and craving
    corecore