33 research outputs found

    Hierarchical Reinforcement Learning under Mixed Observability

    Full text link
    The framework of mixed observable Markov decision processes (MOMDP) models many robotic domains in which some state variables are fully observable while others are not. In this work, we identify a significant subclass of MOMDPs defined by how actions influence the fully observable components of the state and how those, in turn, influence the partially observable components and the rewards. This unique property allows for a two-level hierarchical approach we call HIerarchical Reinforcement Learning under Mixed Observability (HILMO), which restricts partial observability to the top level while the bottom level remains fully observable, enabling higher learning efficiency. The top level produces desired goals to be reached by the bottom level until the task is solved. We further develop theoretical guarantees to show that our approach can achieve optimal and quasi-optimal behavior under mild assumptions. Empirical results on long-horizon continuous control tasks demonstrate the efficacy and efficiency of our approach in terms of improved success rate, sample efficiency, and wall-clock training time. We also deploy policies learned in simulation on a real robot.Comment: Accepted at the 15th International Workshop on the Algorithmic Foundations of Robotics (WAFR) 2022, University of Maryland, College Park. The first two authors contributed equall

    Accelerating decision making under partial observability using learned action priors

    Get PDF
    Thesis (M.Sc.)--University of the Witwatersrand, Faculty of Science, School of Computer Science and Applied Mathematics, 2017.Partially Observable Markov Decision Processes (POMDPs) provide a principled mathematical framework allowing a robot to reason about the consequences of actions and observations with respect to the agent's limited perception of its environment. They allow an agent to plan and act optimally in uncertain environments. Although they have been successfully applied to various robotic tasks, they are infamous for their high computational cost. This thesis demonstrates the use of knowledge transfer, learned from previous experiences, to accelerate the learning of POMDP tasks. We propose that in order for an agent to learn to solve these tasks quicker, it must be able to generalise from past behaviours and transfer knowledge, learned from solving multiple tasks, between di erent circumstances. We present a method for accelerating this learning process by learning the statistics of action choices over the lifetime of an agent, known as action priors. Action priors specify the usefulness of actions in situations and allow us to bias exploration, which in turn improves the performance of the learning process. Using navigation domains, we study the degree to which transferring knowledge between tasks in this way results in a considerable speed up in solution times. This thesis therefore makes the following contributions. We provide an algorithm for learning action priors from a set of approximately optimal value functions and two approaches with which a prior knowledge over actions can be used in a POMDP context. As such, we show that considerable gains in speed can be achieved in learning subsequent tasks using prior knowledge rather than learning from scratch. Learning with action priors can particularly be useful in reducing the cost of exploration in the early stages of the learning process as the priors can act as mechanism that allows the agent to select more useful actions given particular circumstances. Thus, we demonstrate how the initial losses associated with unguided exploration can be alleviated through the use of action priors which allow for safer exploration. Additionally, we illustrate that action priors can also improve the computation speeds of learning feasible policies in a shorter period of time.MT201

    Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation

    Get PDF
    In the field of reinforcement learning, robot task learning in a specific environment with a Markov decision process backdrop has seen much success. But, extending these results to learning a task for an environment domain has not been as fruitful, even for advanced methodologies such as relational reinforcement learning. In our research into robot learning in environment domains, we utilize a form of deictic representation for the robotā€™s description of the task environment. However, the non-Markovian nature of the deictic representation leads to perceptual aliasing and conflicting actions, invalidating standard reinforcement learning algorithms. To circumvent this difficulty, several past research studies have modified and extended the Q-learning algorithm to the deictic representation case with mixed results. Taking a different tact, we introduce a learning algorithm which searches deictic policy space directly, abandoning the indirect value based methods. We apply the policy learning algorithm to several different tasks in environment domains. The results compare favorably with value based learners and existing literature results

    Hierarchical reinforcement learning for trading agents

    Get PDF
    Autonomous software agents, the use of which has increased due to the recent growth in computer power, have considerably improved electronic commerce processes by facilitating automated trading actions between the market participants (sellers, brokers and buyers). The rapidly changing market environments pose challenges to the performance of such agents, which are generally developed for specific market settings. To this end, this thesis is concerned with designing agents that can gradually adapt to variable, dynamic and uncertain markets and that are able to reuse the acquired trading skills in new markets. This thesis proposes the use of reinforcement learning techniques to develop adaptive trading agents and puts forward a novel software architecture based on the semi-Markov decision process and on an innovative knowledge transfer framework. To evaluate my approach, the developed trading agents are tested in internationally well-known market simulations and their behaviours when buying or/and selling in the retail and wholesale markets are analysed. The proposed approach has been shown to improve the adaptation of the trading agent in a specific market as well as to enable the portability of the its knowledge in new markets

    Learning in a State of Confusion: Employing active perception and reinforcement learning in partially observable worlds

    Get PDF
    Institute of Perception, Action and BehaviourIn applying reinforcement learning to agents acting in the real world we are often faced with tasks that are non-Markovian in nature. Much work has been done using state estimation algorithms to try to uncover Markovian models of tasks in order to allow the learning of optimal solutions using reinforcement learning. Unfortunately these algorithms which attempt to simultaneously learn a Markov model of the world and how to act have proved very brittle. Our focus differs. In considering embodied, embedded and situated agents we have a preference for simple learning algorithms which reliably learn satisficing policies. The learning algorithms we consider do not try to uncover the underlying Markovian states, instead they aim to learn successful deterministic reactive policies such that agents actions are based directly upon the observations provided by their sensors. Existing results have shown that such reactive policies can be arbitrarily worse than a policy that has access to the underlying Markov process and in some cases no satisficing reactive policy can exist. Our first contribution is to show that providing agents with alternative actions and viewpoints on the task through the addition of active perception can provide a practical solution in such circumstances. We demonstrate empirically that: (i) adding arbitrary active perception actions to agents which can only learn deterministic reactive policies can allow the learning of satisficing policies where none were originally possible; (ii) active perception actions allow the learning of better satisficing policies than those that existed previously and (iii) our approach converges more reliably to satisficing solutions than existing state estimation algorithms such as U-Tree and the Lion Algorithm. Our other contributions focus on issues which affect the reliability with which deterministic reactive satisficing policies can be learnt in non-Markovian environments. We show that that greedy action selection may be a necessary condition for the existence of stable deterministic reactive policies on partially observable Markov decision processes (POMDPs). We also set out the concept of Consistent Exploration. This is the idea of estimating state-action values by acting as though the policy has been changed to incorporate the action being explored. We demonstrate that this concept can be used to develop better algorithms for learning reactive policies to POMDPs by presenting a new reinforcement learning algorithm; the Consistent Exploration Q(l) algorithm (CEQ(l)). We demonstrate on a significant number of problems that CEQ(l) is more reliable at learning satisficing solutions than the algorithm currently regarded as the best for learning deterministic reactive policies, that of SARSA(l)

    Reinforcement learning approaches to the analysis of the emergence of goal-directed behaviour

    Get PDF
    Over recent decades, theoretical neuroscience, helped by computational methods such as Reinforcement Learning (RL), has provided detailed descriptions of the psychology and neurobiology of decision-making. RL has provided many insights into the mechanisms underlying decision-making processes from neuronal to behavioral levels. In this work, we attempt to demonstrate the effectiveness of RL methods in explaining behavior in a normative setting through three main case studies. Evidence from literature shows that, apart from the commonly discussed cognitive search process, that governs the solution procedure of a planning task, there is an online perceptual process that directs the action selection towards moves that appear more ā€˜naturalā€™ at a given configuration of a task. These two processes can be partially dissociated through developmental studies, with perceptual processes apparently more dominant in the planning of younger children, prior to the maturation of executive functions required for the control of search. Therefore, we present a formalization of planning processes to account for perceptual features of the task, and relate it to human data. Although young children are able to demonstrate their preferences by using physical actions, infants are restricted because of their as-yet-undeveloped motor skills. Eye-tracking methods have been employed to tackle this difficulty. Exploring different model-free RL algorithms and their possible cognitive realizations in decision making, in a second case study, we demonstrate behavioral signatures of decision making processes in eye-movement data and provide a potential framework for integrating eye-movement patterns with behavioral patterns. Finally, in a third project we examine how uncertainty in choices might guide exploration in 10-year-olds, using an abstract RL-based mathematical model. Throughout, aspects of action selection are seen as emerging from the RL computational framework. We, thus, conclude that computational descriptions of the developing decision making functions provide one plausible avenue by which to normatively characterize and define the functions that control action selection
    corecore