20 research outputs found
The MADP Toolbox: An Open-Source Library for Planning and Learning in (Multi-)Agent Systems
This article describes the MultiAgent Decision Process (MADP) toolbox, a software library to support planning and learning for intelligent agents and multiagent systems in un- certain environments. Some of its key features are that it sup- ports partially observable environments and stochastic tran- sition models; has unified support for single- and multiagent systems; provides a large number of models for decision- theoretic decision making, including one-shot decision mak- ing (e.g., Bayesian games) and sequential decision mak- ing under various assumptions of observability and coopera- tion, such as Dec-POMDPs and POSGs; provides tools and parsers to quickly prototype new problems; provides an ex- tensive range of planning and learning algorithms for single- and multiagent systems; and is written in C++ and designed to be extensible via the object-oriented paradigm
Local Communication Protocols for Learning Complex Swarm Behaviors with Deep Reinforcement Learning
Swarm systems constitute a challenging problem for reinforcement learning
(RL) as the algorithm needs to learn decentralized control policies that can
cope with limited local sensing and communication abilities of the agents.
While it is often difficult to directly define the behavior of the agents,
simple communication protocols can be defined more easily using prior knowledge
about the given task. In this paper, we propose a number of simple
communication protocols that can be exploited by deep reinforcement learning to
find decentralized control policies in a multi-robot swarm environment. The
protocols are based on histograms that encode the local neighborhood relations
of the agents and can also transmit task-specific information, such as the
shortest distance and direction to a desired target. In our framework, we use
an adaptation of Trust Region Policy Optimization to learn complex
collaborative tasks, such as formation building and building a communication
link. We evaluate our findings in a simulated 2D-physics environment, and
compare the implications of different communication protocols.Comment: 13 pages, 4 figures, version 2, accepted at ANTS 201
LiftUpp: Support to Develop Learner Performance
Various motivations exist to move away from the simple assessment of
knowledge towards the more complex assessment and development of competence.
However, to accommodate such a change, high demands are put on the supporting
e-infrastructure in terms of intelligently collecting and analysing data. In
this paper, we discuss these challenges and how they are being addressed by
LiftUpp, a system that is now used in 70% of UK dental schools, and is finding
wider applications in physiotherapy, medicine and veterinary science. We
describe how data is collected for workplace-based development in dentistry
using a dedicated iPad app, which enables an integrated approach to linking and
assessing work flows, skills and learning outcomes. Furthermore, we detail how
the various forms of collected data can be fused, visualized and integrated
with conventional forms of assessment. This enables curriculum integration,
improved real-time student feedback, support for administration, and informed
instructional planning. Together these facets contribute to better support for
the development of learners' competence in situated learning setting, as well
as an improved experience. Finally, we discuss several directions for future
research on intelligent teaching systems that are afforded by using the design
present within LiftUpp.Comment: Short 4-page version to appear at AIED 201
PAC greedy maximization with efficient bounds on information gain for sensor selection
Submodular function maximization finds application in a variety of real-world decision-making problems. However, most existing methods, based on greedy maximization, assume it is computationally feasible to evaluate F , the function being maximized. Unfortunately, in many realistic settings F is too expensive to evaluate exactly even once. We present probably approximately correct greedy maximization, which requires access only to cheap anytime confidence bounds on F and uses them to prune elements. We show that, with high probability, our method returns an approximately optimal set. We also propose novel, cheap confidence bounds for conditional entropy, which appears in many common choices of F and for which it is difficult to find unbiased or bounded estimates. Finally, results on a real-world dataset from a multicamera tracking system in a shopping mall demonstrate that our approach performs comparably to existing methods, but at a fraction of the computational cost
MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning
Inferring reward functions from demonstrations and pairwise preferences are auspicious approaches for aligning Reinforcement Learning (RL) agents with human intentions. However, state-of-the art methods typically focus on learning a single reward model, thus rendering it difficult to trade off different reward functions from multiple experts. We propose Multi-Objective Reinforced Active Learning (MORAL), a novel method for combining diverse demonstrations of social norms into a Pareto-optimal policy. Through maintaining a distribution over scalarization weights, our approach is able to interactively tune a deep RL agent towards a variety of preferences, while eliminating the need for computing multiple policies. We empirically demonstrate the effectiveness of MORAL in two scenarios, which model a delivery and an emergency task that require an agent to act in the presence of normative conflicts. Overall, we consider our research a step towards multi-objective RL with learned rewards, bridging the gap between current reward learning and machine ethics literature
Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning
Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. However, given the lack of theoretical insight, it remains unclear what the employed neural networks are learning, or how we should enhance their learning power to address the problems on which they fail. In this work, we empirically investigate the learning power of various network architectures on a series of one-shot games. Despite their simplicity, these games capture many of the crucial problems that arise in the multi-agent setting, such as an exponential number of joint actions or the lack of an explicit coordination mechanism. Our results extend those in Castellini et al. (Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS’19.International Foundation for Autonomous Agents and Multiagent Systems, pp 1862–1864, 2019) and quantify how well various approaches can represent the requisite value functions, and help us identify the reasons that can impede good performance, like sparsity of the values or too tight coordination requirements
Exploiting submodular value functions for scaling up active perception
In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We analyze ρPOMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. We show the mathematical equivalence of these two frameworks by showing that given a ρPOMDP along with a policy, they can be reduced to a POMDP-IR and an equivalent policy (and vice-versa). We prove that the value function for the given ρPOMDP (and the given policy) and the reduced POMDP-IR (and the reduced policy) is the same. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and ρPOMDP). We propose greedy pointbased value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multicamera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.</p
Maximizing information gain in partially observable environments via prediction rewards
Information gathering in a partially observable environment can be formulated as a reinforcement learning (RL), problem where the reward depends on the agent's uncertainty. For example, the reward can be the negative entropy of the agent's belief over an unknown (or hidden) variable. Typically, the rewards of an RL agent are defined as a function of the state-action pairs and not as a function of the belief of the agent; this hinders the direct application of deep RL methods for such tasks. This paper tackles the challenge of using belief-based rewards for a deep RL agent, by offering a simple insight that maximizing any convex function of the belief of the agent can be approximated by instead maximizing a prediction reward: a reward based on prediction accuracy. In particular, we derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards-namely visual attention, question answering systems, and intrinsic motivation-and highlights their connection to the usually distinct fields of active perception, active sensing, and sensor placement. Based on this insight we present deep anticipatory networks (DANs), which enables an agent to take actions to reduce its uncertainty without performing explicit belief inference. We present two applications of DANs: building a sensor selection system for tracking people in a shopping mall and learning discrete models of attention on fashion MNIST and MNIST digit classification
An interactive, web-based tool for genealogical entity resolution
We demonstrate an interactive, web-based tool which helps historians to do Genealogical Entitiy Resolution. This work has two main goals. First, it uses Machine Learning (ML) algorithms to assist humanites researchers to perform Genealogical Entity Resolution. Second, it facilitates the generation of benchmark data for computer scientists to improve available ML-based Entity Resolution techniques