304 research outputs found

    Echo state model of non-Markovian reinforcement learning, An

    Get PDF
    Department Head: Dale H. Grit.2008 Spring.Includes bibliographical references (pages 137-142).There exists a growing need for intelligent, autonomous control strategies that operate in real-world domains. Theoretically the state-action space must exhibit the Markov property in order for reinforcement learning to be applicable. Empirical evidence, however, suggests that reinforcement learning also applies to domains where the state-action space is approximately Markovian, a requirement for the overwhelming majority of real-world domains. These domains, termed non-Markovian reinforcement learning domains, raise a unique set of practical challenges. The reconstruction dimension required to approximate a Markovian state-space is unknown a priori and can potentially be large. Further, spatial complexity of local function approximation of the reinforcement learning domain grows exponentially with the reconstruction dimension. Parameterized dynamic systems alleviate both embedding length and state-space dimensionality concerns by reconstructing an approximate Markovian state-space via a compact, recurrent representation. Yet this representation extracts a cost; modeling reinforcement learning domains via adaptive, parameterized dynamic systems is characterized by instability, slow-convergence, and high computational or spatial training complexity. The objectives of this research are to demonstrate a stable, convergent, accurate, and scalable model of non-Markovian reinforcement learning domains. These objectives are fulfilled via fixed point analysis of the dynamics underlying the reinforcement learning domain and the Echo State Network, a class of parameterized dynamic system. Understanding models of non-Markovian reinforcement learning domains requires understanding the interactions between learning domains and their models. Fixed point analysis of the Mountain Car Problem reinforcement learning domain, for both local and nonlocal function approximations, suggests a close relationship between the locality of the approximation and the number and severity of bifurcations of the fixed point structure. This research suggests the likely cause of this relationship: reinforcement learning domains exist within a dynamic feature space in which trajectories are analogous to states. The fixed point structure maps dynamic space onto state-space. This explanation suggests two testable hypotheses. Reinforcement learning is sensitive to state-space locality because states cluster as trajectories in time rather than space. Second, models using trajectory-based features should exhibit good modeling performance and few changes in fixed point structure. Analysis of performance of lookup table, feedforward neural network, and Echo State Network (ESN) on the Mountain Car Problem reinforcement learning domain confirm these hypotheses. The ESN is a large, sparse, randomly-generated, unadapted recurrent neural network, which adapts a linear projection of the target domain onto the hidden layer. ESN modeling results on reinforcement learning domains show it achieves performance comparable to lookup table and neural network architectures on the Mountain Car Problem with minimal changes to fixed point structure. Also, the ESN achieves lookup table caliber performance when modeling Acrobot, a four-dimensional control problem, but is less successful modeling the lower dimensional Modified Mountain Car Problem. These performance discrepancies are attributed to the ESN’s excellent ability to represent complex short term dynamics, and its inability to consolidate long temporal dependencies into a static memory. Without memory consolidation, reinforcement learning domains exhibiting attractors with multiple dynamic scales are unlikely to be well-modeled via ESN. To mediate this problem, a simple ESN memory consolidation method is presented and tested for stationary dynamic systems. These results indicate the potential to improve modeling performance in reinforcement learning domains via memory consolidation

    Learning in a State of Confusion: Employing active perception and reinforcement learning in partially observable worlds

    Get PDF
    Institute of Perception, Action and BehaviourIn applying reinforcement learning to agents acting in the real world we are often faced with tasks that are non-Markovian in nature. Much work has been done using state estimation algorithms to try to uncover Markovian models of tasks in order to allow the learning of optimal solutions using reinforcement learning. Unfortunately these algorithms which attempt to simultaneously learn a Markov model of the world and how to act have proved very brittle. Our focus differs. In considering embodied, embedded and situated agents we have a preference for simple learning algorithms which reliably learn satisficing policies. The learning algorithms we consider do not try to uncover the underlying Markovian states, instead they aim to learn successful deterministic reactive policies such that agents actions are based directly upon the observations provided by their sensors. Existing results have shown that such reactive policies can be arbitrarily worse than a policy that has access to the underlying Markov process and in some cases no satisficing reactive policy can exist. Our first contribution is to show that providing agents with alternative actions and viewpoints on the task through the addition of active perception can provide a practical solution in such circumstances. We demonstrate empirically that: (i) adding arbitrary active perception actions to agents which can only learn deterministic reactive policies can allow the learning of satisficing policies where none were originally possible; (ii) active perception actions allow the learning of better satisficing policies than those that existed previously and (iii) our approach converges more reliably to satisficing solutions than existing state estimation algorithms such as U-Tree and the Lion Algorithm. Our other contributions focus on issues which affect the reliability with which deterministic reactive satisficing policies can be learnt in non-Markovian environments. We show that that greedy action selection may be a necessary condition for the existence of stable deterministic reactive policies on partially observable Markov decision processes (POMDPs). We also set out the concept of Consistent Exploration. This is the idea of estimating state-action values by acting as though the policy has been changed to incorporate the action being explored. We demonstrate that this concept can be used to develop better algorithms for learning reactive policies to POMDPs by presenting a new reinforcement learning algorithm; the Consistent Exploration Q(l) algorithm (CEQ(l)). We demonstrate on a significant number of problems that CEQ(l) is more reliable at learning satisficing solutions than the algorithm currently regarded as the best for learning deterministic reactive policies, that of SARSA(l)

    Multi-Agent Learning for Security and Sustainability

    Get PDF
    This thesis studies the application of multi-agent learning in complex domains where safety and sustainability are crucial. We target some of the main obstacles in the deployment of multi-agent learning techniques in such domains. These obstacles consist of modelling complex environments with multi-agent interaction, designing robust learning processes and modelling adversarial agents. The main goal of using modern multi-agent learning methods is to improve the effectiveness of behaviour in such domains, and hence increase sustainability and security. This thesis investigates three complex real-world domains: space debris removal, critical domains with risky states and spatial security domains such as illegal rhino poaching. We first tackle the challenge of modelling a complex multi-agent environment. The focus is on the space debris removal problem, which poses a major threat to the sustainability of earth orbit. We develop a high-fidelity space debris simulator that allows us to simulate the future evolution of the space debris environment. Using the data from the simulator we propose a surrogate model, which enables fast evaluation of different strategies chosen by the space actors. We then analyse the dynamics of strategic decision making among multiple space actors, comparing different models of agent interaction: static vs. dynamic and centralised vs. decentralised. The outcome of our work can help future decision makers to design debris removal strategies, and consequently mitigate the threat of space debris. Next, we study how we can design a robust learning process in critical domains with risky states, where destabilisation of local components can lead to severe impact on the whole network. We propose a novel robust operator κ which can be combined with reinforcement learning methods, leading to learning safe policies, mitigating the threat of external attack, or failure in the system. Finally, we investigate the challenge of learning an effective behaviour while facing adversarial attackers in spatial security domains such as illegal rhino poaching. We assume that such attackers can be occasionally observed. Our approach consists of combining Bayesian inference with temporal difference learning, in order to build a model of the attacker behaviour. Our method can effectively use the partial observability of the attacker’s location and approximate the performance of a full observability case. This thesis therefore presents novel methods and tackles several important obstacles in deploying multi-agent learning algorithms in the real-world, which further narrows the reality gap between theoretical models and real-world applications
    corecore