66 research outputs found

    Learning the Structure of Continuous Markov Decision Processes

    Get PDF
    There is growing interest in artificial, intelligent agents which can operate autonomously for an extended period of time in complex environments and fulfill a variety of different tasks. Such agents will face different problems during their lifetime which may not be foreseeable at the time of their deployment. Thus, the capacity for lifelong learning of new behaviors is an essential prerequisite for this kind of agents as it enables them to deal with unforeseen situations. However, learning every complex behavior anew from scratch would be cumbersome for the agent. It is more plausible to consider behavior to be modular and let the agent acquire a set of reusable building blocks for behavior, the so-called skills. These skills might, once acquired, facilitate fast learning and adaptation of behavior to new situations. This work focuses on computational approaches for skill acquisition, namely which kind of skills shall be acquired and how to acquire them. The former is commonly denoted as skill discovery and the latter as skill learning . The main contribution of this thesis is a novel incremental skill acquisition approach which is suited for lifelong learning. In this approach, the agent learns incrementally a graph-based representation of a domain and exploits certain properties of this graph such as its bottlenecks for skill discovery. This thesis proposes a novel approach for learning a graph-based representation of continuous domains based on formalizing the problem as a probabilistic generative model. Furthermore, a new incremental agglomerative clustering approach for identifying bottlenecks of such graphs is presented. Thereupon, the thesis proposes a novel intrinsic motivation system which enables an agent to intelligently allocate time between skill discovery and skill learning in developmental settings, where the agent is not constrained by external tasks. The results of this thesis show that the resulting skill acquisition approach is suited for continuous domains and can deal with domain stochasticity and different explorative behavior of the agent. The acquired skills are reusable and versatile and can be used in multi-task and lifelong learning settings in high-dimensional problems

    Intentions and Creative Insights: a Reinforcement Learning Study of Creative Exploration in Problem-Solving

    Get PDF
    Insight is perhaps the cognitive phenomenon most closely associated with creativity. People engaged in problem-solving sometimes experience a sudden transformation: they see the problem in a radically different manner, and simultaneously feel with great certainty that they have found the right solution. The change of problem representation is called "restructuring", and the affective changes associated with sudden progress are called the "Aha!" experience. Together, restructuring and the "Aha!" experience characterize insight. Reinforcement Learning is both a theory of biological learning and a subfield of machine learning. In its psychological and neuroscientific guise, it is used to model habit formation, and, increasingly, executive function. In its artificial intelligence guise, it is currently the favored paradigm for modeling agents interacting with an environment. Reinforcement learning, I argue, can serve as a model of insight: its foundation in learning coincides with the role of experience in insight problem-solving; its use of an explicit "value" provides the basis for the "Aha!" experience; and finally, in a hierarchical form, it can achieve a sudden change of representation resembling restructuring. An experiment helps confirm some parallels between reinforcement learning and insight. It shows how transfer from prior tasks results in considerably accelerated learning, and how the value function increase resembles the sense of progress corresponding to the "Aha!"-moment. However, a model of insight on the basis of hierarchical reinforcement learning did not display the expected "insightful" behavior. A second model of insight is presented, in which temporal abstraction is based on self-prediction: by predicting its own future decisions, an agent adjusts its course of action on the basis of unexpected events. This kind of temporal abstraction, I argue, corresponds to what we call "intentions", and offers a promising model for biological insight. It explains the "Aha!" experience as resulting from a temporal difference error, whereas restructuring results from an adjustment of the agent's internal state on the basis of either new information or a stochastic interpretation of stimuli. The model is called the actor-critic-intention (ACI) architecture. Finally, the relationship between intentions, insight, and creativity is extensively discussed in light of these models: other works in the philosophical and scientific literature are related to, and sometimes illuminated by the ACI architecture

    Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

    Full text link
    As a pivotal component to attaining generalizable solutions in human intelligence, reasoning provides great potential for reinforcement learning (RL) agents' generalization towards varied goals by summarizing part-to-whole arguments and discovering cause-and-effect relations. However, how to discover and represent causalities remains a huge gap that hinders the development of causal RL. In this paper, we augment Goal-Conditioned RL (GCRL) with Causal Graph (CG), a structure built upon the relation between objects and events. We novelly formulate the GCRL problem into variational likelihood maximization with CG as latent variables. To optimize the derived objective, we propose a framework with theoretical performance guarantees that alternates between two steps: using interventional data to estimate the posterior of CG; using CG to learn generalizable models and interpretable policies. Due to the lack of public benchmarks that verify generalization capability under reasoning, we design nine tasks and then empirically show the effectiveness of the proposed method against five baselines on these tasks. Further theoretical analysis shows that our performance improvement is attributed to the virtuous cycle of causal discovery, transition modeling, and policy training, which aligns with the experimental evidence in extensive ablation studies.Comment: 28 pages, 5 figures, under revie

    Policy space abstraction for a lifelong learning agent

    Get PDF
    This thesis is concerned with policy space abstractions that concisely encode alternative ways of making decisions; dealing with discovery, learning, adaptation and use of these abstractions. This work is motivated by the problem faced by autonomous agents that operate within a domain for long periods of time, hence having to learn to solve many different task instances that share some structural attributes. An example of such a domain is an autonomous robot in a dynamic domestic environment. Such environments raise the need for transfer of knowledge, so as to eliminate the need for long learning trials after deployment. Typically, these tasks would be modelled as sequential decision making problems, including path optimisation for navigation tasks, or Markov Decision Process models for more general tasks. Learning within such models often takes the form of online learning or reinforcement learning. However, handling issues such as knowledge transfer and multiple task instances requires notions of structure and hierarchy, and that raises several questions that form the topic of this thesis – (a) can an agent acquire such hierarchies in policies in an online, incremental manner, (b) can we devise mathematically rigorous ways to abstract policies based on qualitative attributes, (c) when it is inconvenient to employ prolonged trial and error learning, can we devise alternate algorithmic methods for decision making in a lifelong setting? The first contribution of this thesis is an algorithmic method for incrementally acquiring hierarchical policies. Working with the framework of options - temporally extended actions - in reinforcement learning, we present a method for discovering persistent subtasks that define useful options for a particular domain. Our algorithm builds on a probabilistic mixture model in state space to define a generalised and persistent form of ‘bottlenecks’, and suggests suitable policy fragments to make options. In order to continuously update this hierarchy, we devise an incremental process which runs in the background and takes care of proposing and forgetting options. We evaluate this framework in simulated worlds, including the RoboCup 2D simulation league domain. The second contribution of this thesis is in defining abstractions in terms of equivalence classes of trajectories. Utilising recently developed techniques from computational topology, in particular the concept of persistent homology, we show that a library of feasible trajectories could be retracted to representative paths that may be sufficient for reasoning about plans at the abstract level. We present a complete framework, starting from a novel construction of a simplicial complex that describes higher-order connectivity properties of a spatial domain, to methods for computing the homology of this complex at varying resolutions. The resulting abstractions are motion primitives that may be used as topological options, contributing a novel criterion for option discovery. This is validated by experiments in simulated 2D robot navigation, and in manipulation using a physical robot platform. Finally, we develop techniques for solving a family of related, but different, problem instances through policy reuse of a finite policy library acquired over the agent’s lifetime. This represents an alternative approach when traditional methods such as hierarchical reinforcement learning are not computationally feasible. We abstract the policy space using a non-parametric model of performance of policies in multiple task instances, so that decision making is posed as a Bayesian choice regarding what to reuse. This is one approach to transfer learning that is motivated by the needs of practical long-lived systems. We show the merits of such Bayesian policy reuse in simulated real-time interactive systems, including online personalisation and surveillance

    What Went Wrong? Closing the Sim-to-Real Gap via Differentiable Causal Discovery

    Full text link
    Training control policies in simulation is more appealing than on real robots directly, as it allows for exploring diverse states in a safe and efficient manner. Yet, robot simulators inevitably exhibit disparities from the real world, yielding inaccuracies that manifest as the simulation-to-real gap. Existing literature has proposed to close this gap by actively modifying specific simulator parameters to align the simulated data with real-world observations. However, the set of tunable parameters is usually manually selected to reduce the search space in a case-by-case manner, which is hard to scale up for complex systems and requires extensive domain knowledge. To address the scalability issue and automate the parameter-tuning process, we introduce an approach that aligns the simulator with the real world by discovering the causal relationship between the environment parameters and the sim-to-real gap. Concretely, our method learns a differentiable mapping from the environment parameters to the differences between simulated and real-world robot-object trajectories. This mapping is governed by a simultaneously-learned causal graph to help prune the search space of parameters, provide better interpretability, and improve generalization. We perform experiments to achieve both sim-to-sim and sim-to-real transfer, and show that our method has significant improvements in trajectory alignment and task success rate over strong baselines in a challenging manipulation task

    Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

    Full text link
    In light of the burgeoning success of reinforcement learning (RL) in diverse real-world applications, considerable focus has been directed towards ensuring RL policies are robust to adversarial attacks during test time. Current approaches largely revolve around solving a minimax problem to prepare for potential worst-case scenarios. While effective against strong attacks, these methods often compromise performance in the absence of attacks or the presence of only weak attacks. To address this, we study policy robustness under the well-accepted state-adversarial attack model, extending our focus beyond only worst-case attacks. We first formalize this task at test time as a regret minimization problem and establish its intrinsic hardness in achieving sublinear regret when the baseline policy is from a general continuous policy class, Π\Pi. This finding prompts us to \textit{refine} the baseline policy class Π\Pi prior to test time, aiming for efficient adaptation within a finite policy class \Tilde{\Pi}, which can resort to an adversarial bandit subroutine. In light of the importance of a small, finite \Tilde{\Pi}, we propose a novel training-time algorithm to iteratively discover \textit{non-dominated policies}, forming a near-optimal and minimal \Tilde{\Pi}, thereby ensuring both robustness and test-time efficiency. Empirical validation on the Mujoco corroborates the superiority of our approach in terms of natural and robust performance, as well as adaptability to various attack scenarios.Comment: International Conference on Learning Representations (ICLR) 2024, spotligh

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Relational knowledge and representation for reinforcement learning

    Get PDF
    In reinforcement learning, an agent interacts with the environment, learns from feedback about the quality of its actions, and improves its behaviour or policy in order to maximise its expected utility. Learning efficiently in large scale problems is a major challenge. State aggregation is possible in problems with a first-order structure, allowing the agent to learn in an abstraction of the original problem which is of considerably smaller scale. One approach is to learn the Q-values of actions which are approximated by a relational function approximator. This is the basis for relational reinforcement learning (RRL). We abstract the state with first-order features which consist of only variables, thereby aggregating similar states from all problems of the same domain to abstract states. We study the limitations of RRL due to this abstraction and introduce the concepts of consistent abstraction, subsumption of problems, and abstract-equivalent problems. We propose three methods to overcome the limitations, extending the types of problems our RRL method can solve. Next, to further improve the learning efficiency, we propose to learn different types of generalised knowledge. The policy is influenced by directed exploration based on multiple types of intrinsic rewards and avoids previously encountered dead ends. In addition, we incorporate model-based techniques to provide better quality estimates of the Q-values. Transfer learning is possible by directly leveraging the generalised knowledge to accelerate learning in a new problem. Lastly, we introduce a new class of problems which considers dynamic objects and time-bounded goals. We discuss the complications these bring to RRL and present some solutions. We also propose a framework for multi-agent coordination to achieve joint goals represented by time-bounded goals by decomposing a multi-agent problem into single-agent problems. We evaluate our work empirically in six domains to demonstrate its efficacy in solving large scale problems and transfer learning

    Relational knowledge and representation for reinforcement learning

    Get PDF
    In reinforcement learning, an agent interacts with the environment, learns from feedback about the quality of its actions, and improves its behaviour or policy in order to maximise its expected utility. Learning efficiently in large scale problems is a major challenge. State aggregation is possible in problems with a first-order structure, allowing the agent to learn in an abstraction of the original problem which is of considerably smaller scale. One approach is to learn the Q-values of actions which are approximated by a relational function approximator. This is the basis for relational reinforcement learning (RRL). We abstract the state with first-order features which consist of only variables, thereby aggregating similar states from all problems of the same domain to abstract states. We study the limitations of RRL due to this abstraction and introduce the concepts of consistent abstraction, subsumption of problems, and abstract-equivalent problems. We propose three methods to overcome the limitations, extending the types of problems our RRL method can solve. Next, to further improve the learning efficiency, we propose to learn different types of generalised knowledge. The policy is influenced by directed exploration based on multiple types of intrinsic rewards and avoids previously encountered dead ends. In addition, we incorporate model-based techniques to provide better quality estimates of the Q-values. Transfer learning is possible by directly leveraging the generalised knowledge to accelerate learning in a new problem. Lastly, we introduce a new class of problems which considers dynamic objects and time-bounded goals. We discuss the complications these bring to RRL and present some solutions. We also propose a framework for multi-agent coordination to achieve joint goals represented by time-bounded goals by decomposing a multi-agent problem into single-agent problems. We evaluate our work empirically in six domains to demonstrate its efficacy in solving large scale problems and transfer learning
    corecore