43 research outputs found
Bayesian Policy Reuse
A long-lived autonomous agent should be able to respond online to novel
instances of tasks from a familiar domain. Acting online requires 'fast'
responses, in terms of rapid convergence, especially when the task instance has
a short duration, such as in applications involving interactions with humans.
These requirements can be problematic for many established methods for learning
to act. In domains where the agent knows that the task instance is drawn from a
family of related tasks, albeit without access to the label of any given
instance, it can choose to act through a process of policy reuse from a
library, rather than policy learning from scratch. In policy reuse, the agent
has prior knowledge of the class of tasks in the form of a library of policies
that were learnt from sample task instances during an offline training phase.
We formalise the problem of policy reuse, and present an algorithm for
efficiently responding to a novel task instance by reusing a policy from the
library of existing policies, where the choice is based on observed 'signals'
which correlate to policy performance. We achieve this by posing the problem as
a Bayesian choice problem with a corresponding notion of an optimal response,
but the computation of that response is in many cases intractable. Therefore,
to reduce the computation cost of the posterior, we follow a Bayesian
optimisation approach and define a set of policy selection functions, which
balance exploration in the policy library against exploitation of previously
tried policies, together with a model of expected performance of the policy
library on their corresponding task instances. We validate our method in
several simulated domains of interactive, short-duration episodic tasks,
showing rapid convergence in unknown task variations.Comment: 32 pages, submitted to the Machine Learning Journa
Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning
Bayesian policy reuse (BPR) is a general policy transfer framework for
selecting a source policy from an offline library by inferring the task belief
based on some observation signals and a trained observation model. In this
paper, we propose an improved BPR method to achieve more efficient policy
transfer in deep reinforcement learning (DRL). First, most BPR algorithms use
the episodic return as the observation signal that contains limited information
and cannot be obtained until the end of an episode. Instead, we employ the
state transition sample, which is informative and instantaneous, as the
observation signal for faster and more accurate task inference. Second, BPR
algorithms usually require numerous samples to estimate the probability
distribution of the tabular-based observation model, which may be expensive and
even infeasible to learn and maintain, especially when using the state
transition sample as the signal. Hence, we propose a scalable observation model
based on fitting state transition functions of source tasks from only a small
number of samples, which can generalize to any signals observed in the target
task. Moreover, we extend the offline-mode BPR to the continual learning
setting by expanding the scalable observation model in a plug-and-play fashion,
which can avoid negative transfer when faced with new unknown tasks.
Experimental results show that our method can consistently facilitate faster
and more efficient policy transfer.Comment: 16 pages, 6 figures, under revie
An Optimal Online Method of Selecting Source Policies for Reinforcement Learning
Transfer learning significantly accelerates the reinforcement learning
process by exploiting relevant knowledge from previous experiences. The problem
of optimally selecting source policies during the learning process is of great
importance yet challenging. There has been little theoretical analysis of this
problem. In this paper, we develop an optimal online method to select source
policies for reinforcement learning. This method formulates online source
policy selection as a multi-armed bandit problem and augments Q-learning with
policy reuse. We provide theoretical guarantees of the optimal selection
process and convergence to the optimal policy. In addition, we conduct
experiments on a grid-based robot navigation domain to demonstrate its
efficiency and robustness by comparing to the state-of-the-art transfer
learning method
Learning domain abstractions for long lived robots
Recent trends in robotics have seen more general purpose robots being deployed in
unstructured environments for prolonged periods of time. Such robots are expected to
adapt to different environmental conditions, and ultimately take on a broader range of
responsibilities, the specifications of which may change online after the robot has been
deployed.
We propose that in order for a robot to be generally capable in an online sense
when it encounters a range of unknown tasks, it must have the ability to continually
learn from a lifetime of experience. Key to this is the ability to generalise from experiences
and form representations which facilitate faster learning of new tasks, as well as
the transfer of knowledge between different situations. However, experience cannot be
managed na¨ıvely: one does not want constantly expanding tables of data, but instead
continually refined abstractions of the data – much like humans seem to abstract and
organise knowledge. If this agent is active in the same, or similar, classes of environments
for a prolonged period of time, it is provided with the opportunity to build
abstract representations in order to simplify the learning of future tasks. The domain
is a common structure underlying large families of tasks, and exploiting this affords
the agent the potential to not only minimise relearning from scratch, but over time to
build better models of the environment. We propose to learn such regularities from the
environment, and extract the commonalities between tasks.
This thesis aims to address the major question: what are the domain invariances
which should be learnt by a long lived agent which encounters a range of different
tasks? This question can be decomposed into three dimensions for learning invariances,
based on perception, action and interaction. We present novel algorithms for
dealing with each of these three factors.
Firstly, how does the agent learn to represent the structure of the world? We focus
here on learning inter-object relationships from depth information as a concise
representation of the structure of the domain. To this end we introduce contact point
networks as a topological abstraction of a scene, and present an algorithm based on
support vector machine decision boundaries for extracting these from three dimensional
point clouds obtained from the agent’s experience of a domain. By reducing the
specific geometry of an environment into general skeletons based on contact between
different objects, we can autonomously learn predicates describing spatial relationships.
Secondly, how does the agent learn to acquire general domain knowledge? While
the agent attempts new tasks, it requires a mechanism to control exploration, particularly
when it has many courses of action available to it. To this end we draw on the fact
that many local behaviours are common to different tasks. Identifying these amounts
to learning “common sense” behavioural invariances across multiple tasks. This principle
leads to our concept of action priors, which are defined as Dirichlet distributions
over the action set of the agent. These are learnt from previous behaviours, and expressed
as the prior probability of selecting each action in a state, and are used to guide
the learning of novel tasks as an exploration policy within a reinforcement learning
framework.
Finally, how can the agent react online with sparse information? There are times
when an agent is required to respond fast to some interactive setting, when it may have
encountered similar tasks previously. To address this problem, we introduce the notion
of types, being a latent class variable describing related problem instances. The agent
is required to learn, identify and respond to these different types in online interactive
scenarios. We then introduce Bayesian policy reuse as an algorithm that involves maintaining
beliefs over the current task instance, updating these from sparse signals, and
selecting and instantiating an optimal response from a behaviour library.
This thesis therefore makes the following contributions. We provide the first algorithm
for autonomously learning spatial relationships between objects from point
cloud data. We then provide an algorithm for extracting action priors from a set of
policies, and show that considerable gains in speed can be achieved in learning subsequent
tasks over learning from scratch, particularly in reducing the initial losses associated
with unguided exploration. Additionally, we demonstrate how these action priors
allow for safe exploration, feature selection, and a method for analysing and advising
other agents’ movement through a domain. Finally, we introduce Bayesian policy
reuse which allows an agent to quickly draw on a library of policies and instantiate the
correct one, enabling rapid online responses to adversarial conditions