66 research outputs found
Learning the Structure of Continuous Markov Decision Processes
There is growing interest in artificial, intelligent agents which can operate autonomously for an extended period of time in complex environments and fulfill a variety of different tasks. Such agents will face different problems during their lifetime which may not be foreseeable at the time of their deployment. Thus, the capacity for lifelong learning of new behaviors is an essential prerequisite for this kind of agents as it enables them to deal with unforeseen situations. However, learning every complex behavior anew from scratch would be cumbersome for the agent. It is more plausible to consider behavior to be modular and let the agent acquire a set of reusable building blocks for behavior, the so-called skills. These skills might, once acquired, facilitate fast learning and adaptation of behavior to new situations. This work focuses on computational approaches for skill acquisition, namely which kind of skills shall be acquired and how to acquire them. The former is commonly denoted as skill discovery and the latter as skill learning . The main contribution of this thesis is a novel incremental skill acquisition approach which is suited for lifelong learning. In this approach, the agent learns incrementally a graph-based representation of a domain and exploits certain properties of this graph such as its bottlenecks for skill discovery. This thesis proposes a novel approach for learning a graph-based representation of continuous domains based on formalizing the problem as a probabilistic generative model. Furthermore, a new incremental agglomerative clustering approach for identifying bottlenecks of such graphs is presented. Thereupon, the thesis proposes a novel intrinsic motivation system which enables an agent to intelligently allocate time between skill discovery and skill learning in developmental settings, where the agent is not constrained by external tasks. The results of this thesis show that the resulting skill acquisition approach is suited for continuous domains and can deal with domain stochasticity and different explorative behavior of the agent. The acquired skills are reusable and versatile and can be used in multi-task and lifelong learning settings in high-dimensional problems
Intentions and Creative Insights: a Reinforcement Learning Study of Creative Exploration in Problem-Solving
Insight is perhaps the cognitive phenomenon most closely associated with creativity. People engaged in problem-solving sometimes experience a sudden transformation: they see the problem in a radically different manner, and simultaneously feel with great certainty that they have found the right solution. The change of problem representation is called "restructuring", and the affective changes associated with sudden progress are called the "Aha!" experience. Together, restructuring and the "Aha!" experience characterize insight.
Reinforcement Learning is both a theory of biological learning and a subfield of machine learning. In its psychological and neuroscientific guise, it is used to model habit formation, and, increasingly, executive function. In its artificial intelligence guise, it is currently the favored paradigm for modeling agents interacting with an environment. Reinforcement learning, I argue, can serve as a model of insight: its foundation in learning coincides with the role of experience in insight problem-solving; its use of an explicit "value" provides the basis for the "Aha!" experience; and finally, in a hierarchical form, it can achieve a sudden change of representation resembling restructuring.
An experiment helps confirm some parallels between reinforcement learning and insight. It shows how transfer from prior tasks results in considerably accelerated learning, and how the value function increase resembles the sense of progress corresponding to the "Aha!"-moment. However, a model of insight on the basis of hierarchical reinforcement learning did not display the expected "insightful" behavior.
A second model of insight is presented, in which temporal abstraction is based on self-prediction: by predicting its own future decisions, an agent adjusts its course of action on the basis of unexpected events. This kind of temporal abstraction, I argue, corresponds to what we call "intentions", and offers a promising model for biological insight. It explains the "Aha!" experience as resulting from a temporal difference error, whereas restructuring results from an adjustment of the agent's internal state on the basis of either new information or a stochastic interpretation of stimuli. The model is called the actor-critic-intention (ACI) architecture.
Finally, the relationship between intentions, insight, and creativity is extensively discussed in light of these models: other works in the philosophical and scientific literature are related to, and sometimes illuminated by the ACI architecture
Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning
As a pivotal component to attaining generalizable solutions in human
intelligence, reasoning provides great potential for reinforcement learning
(RL) agents' generalization towards varied goals by summarizing part-to-whole
arguments and discovering cause-and-effect relations. However, how to discover
and represent causalities remains a huge gap that hinders the development of
causal RL. In this paper, we augment Goal-Conditioned RL (GCRL) with Causal
Graph (CG), a structure built upon the relation between objects and events. We
novelly formulate the GCRL problem into variational likelihood maximization
with CG as latent variables. To optimize the derived objective, we propose a
framework with theoretical performance guarantees that alternates between two
steps: using interventional data to estimate the posterior of CG; using CG to
learn generalizable models and interpretable policies. Due to the lack of
public benchmarks that verify generalization capability under reasoning, we
design nine tasks and then empirically show the effectiveness of the proposed
method against five baselines on these tasks. Further theoretical analysis
shows that our performance improvement is attributed to the virtuous cycle of
causal discovery, transition modeling, and policy training, which aligns with
the experimental evidence in extensive ablation studies.Comment: 28 pages, 5 figures, under revie
Policy space abstraction for a lifelong learning agent
This thesis is concerned with policy space abstractions that concisely encode alternative
ways of making decisions; dealing with discovery, learning, adaptation and use of these
abstractions. This work is motivated by the problem faced by autonomous agents that
operate within a domain for long periods of time, hence having to learn to solve many
different task instances that share some structural attributes. An example of such a
domain is an autonomous robot in a dynamic domestic environment. Such environments
raise the need for transfer of knowledge, so as to eliminate the need for long learning
trials after deployment.
Typically, these tasks would be modelled as sequential decision making problems,
including path optimisation for navigation tasks, or Markov Decision Process models for
more general tasks. Learning within such models often takes the form of online learning
or reinforcement learning. However, handling issues such as knowledge transfer and
multiple task instances requires notions of structure and hierarchy, and that raises several
questions that form the topic of this thesis – (a) can an agent acquire such hierarchies in
policies in an online, incremental manner, (b) can we devise mathematically rigorous
ways to abstract policies based on qualitative attributes, (c) when it is inconvenient to
employ prolonged trial and error learning, can we devise alternate algorithmic methods
for decision making in a lifelong setting?
The first contribution of this thesis is an algorithmic method for incrementally
acquiring hierarchical policies. Working with the framework of options - temporally
extended actions - in reinforcement learning, we present a method for discovering
persistent subtasks that define useful options for a particular domain. Our algorithm
builds on a probabilistic mixture model in state space to define a generalised and
persistent form of ‘bottlenecks’, and suggests suitable policy fragments to make options.
In order to continuously update this hierarchy, we devise an incremental process which
runs in the background and takes care of proposing and forgetting options. We evaluate
this framework in simulated worlds, including the RoboCup 2D simulation league
domain.
The second contribution of this thesis is in defining abstractions in terms of equivalence
classes of trajectories. Utilising recently developed techniques from computational
topology, in particular the concept of persistent homology, we show that a library of
feasible trajectories could be retracted to representative paths that may be sufficient for
reasoning about plans at the abstract level. We present a complete framework, starting
from a novel construction of a simplicial complex that describes higher-order connectivity
properties of a spatial domain, to methods for computing the homology of this
complex at varying resolutions. The resulting abstractions are motion primitives that
may be used as topological options, contributing a novel criterion for option discovery.
This is validated by experiments in simulated 2D robot navigation, and in manipulation
using a physical robot platform.
Finally, we develop techniques for solving a family of related, but different, problem
instances through policy reuse of a finite policy library acquired over the agent’s lifetime.
This represents an alternative approach when traditional methods such as hierarchical
reinforcement learning are not computationally feasible. We abstract the policy space
using a non-parametric model of performance of policies in multiple task instances, so
that decision making is posed as a Bayesian choice regarding what to reuse. This is
one approach to transfer learning that is motivated by the needs of practical long-lived
systems. We show the merits of such Bayesian policy reuse in simulated real-time
interactive systems, including online personalisation and surveillance
What Went Wrong? Closing the Sim-to-Real Gap via Differentiable Causal Discovery
Training control policies in simulation is more appealing than on real robots
directly, as it allows for exploring diverse states in a safe and efficient
manner. Yet, robot simulators inevitably exhibit disparities from the real
world, yielding inaccuracies that manifest as the simulation-to-real gap.
Existing literature has proposed to close this gap by actively modifying
specific simulator parameters to align the simulated data with real-world
observations. However, the set of tunable parameters is usually manually
selected to reduce the search space in a case-by-case manner, which is hard to
scale up for complex systems and requires extensive domain knowledge. To
address the scalability issue and automate the parameter-tuning process, we
introduce an approach that aligns the simulator with the real world by
discovering the causal relationship between the environment parameters and the
sim-to-real gap. Concretely, our method learns a differentiable mapping from
the environment parameters to the differences between simulated and real-world
robot-object trajectories. This mapping is governed by a simultaneously-learned
causal graph to help prune the search space of parameters, provide better
interpretability, and improve generalization. We perform experiments to achieve
both sim-to-sim and sim-to-real transfer, and show that our method has
significant improvements in trajectory alignment and task success rate over
strong baselines in a challenging manipulation task
Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
In light of the burgeoning success of reinforcement learning (RL) in diverse
real-world applications, considerable focus has been directed towards ensuring
RL policies are robust to adversarial attacks during test time. Current
approaches largely revolve around solving a minimax problem to prepare for
potential worst-case scenarios. While effective against strong attacks, these
methods often compromise performance in the absence of attacks or the presence
of only weak attacks. To address this, we study policy robustness under the
well-accepted state-adversarial attack model, extending our focus beyond only
worst-case attacks. We first formalize this task at test time as a regret
minimization problem and establish its intrinsic hardness in achieving
sublinear regret when the baseline policy is from a general continuous policy
class, . This finding prompts us to \textit{refine} the baseline policy
class prior to test time, aiming for efficient adaptation within a finite
policy class \Tilde{\Pi}, which can resort to an adversarial bandit
subroutine. In light of the importance of a small, finite \Tilde{\Pi}, we
propose a novel training-time algorithm to iteratively discover
\textit{non-dominated policies}, forming a near-optimal and minimal
\Tilde{\Pi}, thereby ensuring both robustness and test-time efficiency.
Empirical validation on the Mujoco corroborates the superiority of our approach
in terms of natural and robust performance, as well as adaptability to various
attack scenarios.Comment: International Conference on Learning Representations (ICLR) 2024,
spotligh
New Fundamental Technologies in Data Mining
The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
Relational knowledge and representation for reinforcement learning
In reinforcement learning, an agent interacts with the environment, learns from feedback about the quality of its actions, and improves its behaviour or policy in order
to maximise its expected utility. Learning efficiently in large scale problems is a major challenge. State aggregation is possible in problems with a first-order structure,
allowing the agent to learn in an abstraction of the original problem which is of
considerably smaller scale. One approach is to learn the Q-values of actions which
are approximated by a relational function approximator. This is the basis for relational reinforcement learning (RRL). We abstract the state with first-order features
which consist of only variables, thereby aggregating similar states from all problems
of the same domain to abstract states. We study the limitations of RRL due to
this abstraction and introduce the concepts of consistent abstraction, subsumption
of problems, and abstract-equivalent problems. We propose three methods to overcome the limitations, extending the types of problems our RRL method can solve.
Next, to further improve the learning efficiency, we propose to learn different types
of generalised knowledge. The policy is influenced by directed exploration based on
multiple types of intrinsic rewards and avoids previously encountered dead ends. In
addition, we incorporate model-based techniques to provide better quality estimates
of the Q-values. Transfer learning is possible by directly leveraging the generalised
knowledge to accelerate learning in a new problem. Lastly, we introduce a new class
of problems which considers dynamic objects and time-bounded goals. We discuss
the complications these bring to RRL and present some solutions. We also propose a framework for multi-agent coordination to achieve joint goals represented by
time-bounded goals by decomposing a multi-agent problem into single-agent problems. We evaluate our work empirically in six domains to demonstrate its efficacy
in solving large scale problems and transfer learning
Relational knowledge and representation for reinforcement learning
In reinforcement learning, an agent interacts with the environment, learns from feedback about the quality of its actions, and improves its behaviour or policy in order
to maximise its expected utility. Learning efficiently in large scale problems is a major challenge. State aggregation is possible in problems with a first-order structure,
allowing the agent to learn in an abstraction of the original problem which is of
considerably smaller scale. One approach is to learn the Q-values of actions which
are approximated by a relational function approximator. This is the basis for relational reinforcement learning (RRL). We abstract the state with first-order features
which consist of only variables, thereby aggregating similar states from all problems
of the same domain to abstract states. We study the limitations of RRL due to
this abstraction and introduce the concepts of consistent abstraction, subsumption
of problems, and abstract-equivalent problems. We propose three methods to overcome the limitations, extending the types of problems our RRL method can solve.
Next, to further improve the learning efficiency, we propose to learn different types
of generalised knowledge. The policy is influenced by directed exploration based on
multiple types of intrinsic rewards and avoids previously encountered dead ends. In
addition, we incorporate model-based techniques to provide better quality estimates
of the Q-values. Transfer learning is possible by directly leveraging the generalised
knowledge to accelerate learning in a new problem. Lastly, we introduce a new class
of problems which considers dynamic objects and time-bounded goals. We discuss
the complications these bring to RRL and present some solutions. We also propose a framework for multi-agent coordination to achieve joint goals represented by
time-bounded goals by decomposing a multi-agent problem into single-agent problems. We evaluate our work empirically in six domains to demonstrate its efficacy
in solving large scale problems and transfer learning
- …