33 research outputs found
Hierarchical Reinforcement Learning under Mixed Observability
The framework of mixed observable Markov decision processes (MOMDP) models
many robotic domains in which some state variables are fully observable while
others are not. In this work, we identify a significant subclass of MOMDPs
defined by how actions influence the fully observable components of the state
and how those, in turn, influence the partially observable components and the
rewards. This unique property allows for a two-level hierarchical approach we
call HIerarchical Reinforcement Learning under Mixed Observability (HILMO),
which restricts partial observability to the top level while the bottom level
remains fully observable, enabling higher learning efficiency. The top level
produces desired goals to be reached by the bottom level until the task is
solved. We further develop theoretical guarantees to show that our approach can
achieve optimal and quasi-optimal behavior under mild assumptions. Empirical
results on long-horizon continuous control tasks demonstrate the efficacy and
efficiency of our approach in terms of improved success rate, sample
efficiency, and wall-clock training time. We also deploy policies learned in
simulation on a real robot.Comment: Accepted at the 15th International Workshop on the Algorithmic
Foundations of Robotics (WAFR) 2022, University of Maryland, College Park.
The first two authors contributed equall
Accelerating decision making under partial observability using learned action priors
Thesis (M.Sc.)--University of the Witwatersrand, Faculty of Science, School of Computer Science and Applied Mathematics, 2017.Partially Observable Markov Decision Processes (POMDPs) provide a principled mathematical
framework allowing a robot to reason about the consequences of actions and
observations with respect to the agent's limited perception of its environment. They
allow an agent to plan and act optimally in uncertain environments. Although they
have been successfully applied to various robotic tasks, they are infamous for their high
computational cost. This thesis demonstrates the use of knowledge transfer, learned
from previous experiences, to accelerate the learning of POMDP tasks. We propose
that in order for an agent to learn to solve these tasks quicker, it must be able to generalise
from past behaviours and transfer knowledge, learned from solving multiple tasks,
between di erent circumstances. We present a method for accelerating this learning
process by learning the statistics of action choices over the lifetime of an agent, known
as action priors. Action priors specify the usefulness of actions in situations and allow
us to bias exploration, which in turn improves the performance of the learning process.
Using navigation domains, we study the degree to which transferring knowledge
between tasks in this way results in a considerable speed up in solution times.
This thesis therefore makes the following contributions. We provide an algorithm
for learning action priors from a set of approximately optimal value functions and two
approaches with which a prior knowledge over actions can be used in a POMDP context.
As such, we show that considerable gains in speed can be achieved in learning subsequent
tasks using prior knowledge rather than learning from scratch. Learning with
action priors can particularly be useful in reducing the cost of exploration in the early
stages of the learning process as the priors can act as mechanism that allows the agent
to select more useful actions given particular circumstances. Thus, we demonstrate how
the initial losses associated with unguided exploration can be alleviated through the
use of action priors which allow for safer exploration. Additionally, we illustrate that
action priors can also improve the computation speeds of learning feasible policies in a
shorter period of time.MT201
Reinforcement Learning in Robotic Task Domains with Deictic Descriptor Representation
In the field of reinforcement learning, robot task learning in a specific environment with a Markov decision process backdrop has seen much success. But, extending these results to learning a task for an environment domain has not been as fruitful, even for advanced methodologies such as relational reinforcement learning. In our research into robot learning in environment domains, we utilize a form of deictic representation for the robotās description of the task environment. However, the non-Markovian nature of the deictic representation leads to perceptual aliasing and conflicting actions, invalidating standard reinforcement learning algorithms. To circumvent this difficulty, several past research studies have modified and extended the Q-learning algorithm to the deictic representation case with mixed results. Taking a different tact, we introduce a learning algorithm which searches deictic policy space directly, abandoning the indirect value based methods. We apply the policy learning algorithm to several different tasks in environment domains. The results compare favorably with value based learners and existing literature results
Hierarchical reinforcement learning for trading agents
Autonomous software agents, the use of which has increased due to the recent growth in computer power, have considerably improved electronic commerce processes by facilitating automated trading actions between the market participants (sellers, brokers and buyers). The rapidly changing market environments pose challenges to the performance of such agents, which are generally developed for specific market settings. To this end, this thesis is concerned with designing agents that can gradually adapt to variable, dynamic and uncertain markets and that are able to reuse the acquired trading skills in new markets. This thesis proposes the use of reinforcement learning techniques to develop adaptive trading agents and puts forward a novel software architecture based on the semi-Markov decision process and on an innovative knowledge transfer framework. To evaluate my approach, the developed trading agents are tested in internationally well-known market simulations and their behaviours when buying or/and selling in the retail and wholesale markets are analysed. The proposed approach has been shown to improve the adaptation of the trading agent in a specific market as well as to enable the portability of the its knowledge in new markets
Learning in a State of Confusion: Employing active perception and reinforcement learning in partially observable worlds
Institute of Perception, Action and BehaviourIn applying reinforcement learning to agents acting in the real world we are often faced
with tasks that are non-Markovian in nature. Much work has been done using state estimation
algorithms to try to uncover Markovian models of tasks in order to allow the
learning of optimal solutions using reinforcement learning. Unfortunately these algorithms
which attempt to simultaneously learn a Markov model of the world and how
to act have proved very brittle. Our focus differs. In considering embodied, embedded
and situated agents we have a preference for simple learning algorithms which reliably
learn satisficing policies. The learning algorithms we consider do not try to uncover the
underlying Markovian states, instead they aim to learn successful deterministic reactive
policies such that agents actions are based directly upon the observations provided
by their sensors.
Existing results have shown that such reactive policies can be arbitrarily worse than a
policy that has access to the underlying Markov process and in some cases no satisficing
reactive policy can exist. Our first contribution is to show that providing agents
with alternative actions and viewpoints on the task through the addition of active perception
can provide a practical solution in such circumstances. We demonstrate empirically
that: (i) adding arbitrary active perception actions to agents which can only
learn deterministic reactive policies can allow the learning of satisficing policies where
none were originally possible; (ii) active perception actions allow the learning of better
satisficing policies than those that existed previously and (iii) our approach converges
more reliably to satisficing solutions than existing state estimation algorithms such as
U-Tree and the Lion Algorithm.
Our other contributions focus on issues which affect the reliability with which deterministic
reactive satisficing policies can be learnt in non-Markovian environments. We
show that that greedy action selection may be a necessary condition for the existence
of stable deterministic reactive policies on partially observable Markov decision processes
(POMDPs). We also set out the concept of Consistent Exploration. This is the
idea of estimating state-action values by acting as though the policy has been changed
to incorporate the action being explored. We demonstrate that this concept can be used
to develop better algorithms for learning reactive policies to POMDPs by presenting
a new reinforcement learning algorithm; the Consistent Exploration Q(l) algorithm
(CEQ(l)). We demonstrate on a significant number of problems that CEQ(l) is more
reliable at learning satisficing solutions than the algorithm currently regarded as the
best for learning deterministic reactive policies, that of SARSA(l)
Reinforcement learning approaches to the analysis of the emergence of goal-directed behaviour
Over recent decades, theoretical neuroscience, helped by computational methods
such as Reinforcement Learning (RL), has provided detailed descriptions of the
psychology and neurobiology of decision-making. RL has provided many insights
into the mechanisms underlying decision-making processes from neuronal to behavioral
levels. In this work, we attempt to demonstrate the effectiveness of RL
methods in explaining behavior in a normative setting through three main case
studies.
Evidence from literature shows that, apart from the commonly discussed cognitive
search process, that governs the solution procedure of a planning task, there
is an online perceptual process that directs the action selection towards moves that
appear more ānaturalā at a given configuration of a task. These two processes can
be partially dissociated through developmental studies, with perceptual processes
apparently more dominant in the planning of younger children, prior to the maturation
of executive functions required for the control of search. Therefore, we
present a formalization of planning processes to account for perceptual features of
the task, and relate it to human data.
Although young children are able to demonstrate their preferences by using
physical actions, infants are restricted because of their as-yet-undeveloped motor
skills. Eye-tracking methods have been employed to tackle this difficulty. Exploring
different model-free RL algorithms and their possible cognitive realizations in
decision making, in a second case study, we demonstrate behavioral signatures of
decision making processes in eye-movement data and provide a potential framework
for integrating eye-movement patterns with behavioral patterns.
Finally, in a third project we examine how uncertainty in choices might guide exploration
in 10-year-olds, using an abstract RL-based mathematical model. Throughout,
aspects of action selection are seen as emerging from the RL computational
framework. We, thus, conclude that computational descriptions of the developing
decision making functions provide one plausible avenue by which to normatively characterize and define the functions that control action selection