28 research outputs found
Decentralized multi-agent reinforcement learning in average-reward dynamic DCOPs
Researchers have introduced the Dynamic Distributed Constraint Optimization Problem (Dynamic DCOP) formulation to model dynamically changing multi-agent coordination problems, where a dynamic DCOP is a sequence of (static canonical) DCOPs, each partially different from the DCOP preceding it. Existing work typically assumes that the problem in each time step is decoupled from the problems in other time steps, which might not hold in some applications. Therefore, in this paper, we make the following contributions: (i) We introduce a new model, called Markovian Dynamic DCOPs (MD-DCOPs), where the DCOP in the next time step is a function of the value assignments in the current time step; (ii) We introduce two distributed reinforcement learning algorithms, the Distributed RVI Q-learning algorithm and the Distributed R-learning algorithm, that balance exploration and exploitation to solve MD-DCOPs in an online manner; and (iii) We empirically evaluate them against an existing multi-arm bandit DCOP algorithm on dynamic DCOPs
Coordinating decentralized learning and conflict resolution across agent boundaries
It is crucial for embedded systems to adapt to the dynamics of open environments. This adaptation process becomes especially challenging in the context of multiagent systems because of scalability, partial information accessibility and complex interaction of agents. It is a challenge for agents to learn good policies, when they need to plan and coordinate in uncertain, dynamic environments, especially when they have large state spaces. It is also critical for agents operating in a multiagent system (MAS) to resolve conflicts among the learned policies of different agents, since such conflicts may have detrimental influence on the overall performance.
The focus of this research is to use a reinforcement learning based local optimization algorithm within each agent to learn multiagent policies in a decentralized fashion. These policies will allow each agent to adapt to changes in environmental conditions while reorganizing the underlying multiagent network when needed. The research takes an adaptive approach to resolving conflicts that can arise between locally optimal agent policies. First an algorithm that uses heuristic rules to locally resolve simple conflicts is presented. When the environment is more dynamic and uncertain, a mediator-based mechanism to resolve more complicated conflicts and selectively expand the agents' state space during the learning process is harnessed. For scenarios where mediator-based mechanisms with partially global views are ineffective, a more rigorous approach for global conflict resolution that synthesizes multiagent reinforcement learning (MARL) and distributed constraint optimization (DCOP) is developed. These mechanisms are evaluated in the context of a multiagent tornado tracking application called NetRads. Empirical results show that these mechanisms significantly improve the performance of the tornado tracking network for a variety of weather scenarios.
The major contributions of this work are: a state of the art decentralized learning approach that supports agent interactions and reorganizes the underlying network when needed; the use of abstract classes of scenarios/states/actions that efficiently manages the exploration of the search space; novel conflict resolution algorithms of increasing complexity that use heuristic rules, sophisticated automated negotiation mechanisms and distributed constraint optimization methods respectively; and finally, a rigorous study of the interplay between two popular theories used to solve multiagent problems, namely decentralized Markov decision processes and distributed constraint optimization
Distributed Online Learning via Cooperative Contextual Bandits
In this paper we propose a novel framework for decentralized, online learning
by many learners. At each moment of time, an instance characterized by a
certain context may arrive to each learner; based on the context, the learner
can select one of its own actions (which gives a reward and provides
information) or request assistance from another learner. In the latter case,
the requester pays a cost and receives the reward but the provider learns the
information. In our framework, learners are modeled as cooperative contextual
bandits. Each learner seeks to maximize the expected reward from its arrivals,
which involves trading off the reward received from its own actions, the
information learned from its own actions, the reward received from the actions
requested of others and the cost paid for these actions - taking into account
what it has learned about the value of assistance from each other learner. We
develop distributed online learning algorithms and provide analytic bounds to
compare the efficiency of these with algorithms with the complete knowledge
(oracle) benchmark (in which the expected reward of every action in every
context is known by every learner). Our estimates show that regret - the loss
incurred by the algorithm - is sublinear in time. Our theoretical framework can
be used in many practical applications including Big Data mining, event
detection in surveillance sensor networks and distributed online recommendation
systems
Multiagent systems: games and learning from structures
Multiple agents have become increasingly utilized in various fields for both physical robots and software agents, such as search and rescue robots, automated driving, auctions and electronic commerce agents, and so on. In multiagent domains, agents interact and coadapt with other agents. Each agent's choice of policy depends on the others' joint policy to achieve the best available performance. During this process, the environment evolves and is no longer stationary, where each agent adapts to proceed towards its target. Each micro-level step in time may present a different learning problem which needs to be addressed. However, in this non-stationary environment, a holistic phenomenon forms along with the rational strategies of all players; we define this phenomenon as structural properties.
In our research, we present the importance of analyzing the structural properties, and how to extract the structural properties in multiagent environments. According to the agents' objectives, a multiagent environment can be classified as self-interested, cooperative, or competitive. We examine the structure from these three general multiagent environments: self-interested random graphical game playing, distributed cooperative team playing, and competitive group survival. In each scenario, we analyze the structure in each environmental setting, and demonstrate the structure learned as a comprehensive representation: structure of players' action influence, structure of constraints in teamwork communication, and structure of inter-connections among strategies. This structure represents macro-level knowledge arising in a multiagent system, and provides critical, holistic information for each problem domain. Last, we present some open issues and point toward future research
Dynamic Continuous Distributed Constraint Optimization Problems
The Distributed Constraint Optimization Problem (DCOP) formulation is a powerful tool to model multi-agent coordination problems that are distributed by nature. The formulation is suitable for problems where the environment does not change over time and where agents seek their value assignment from a discrete domain. However, in many real-world applications, agents often interact in a more dynamic environment and their variables usually require a more complex domain. Thus, the DCOP formulation lacks the capabilities to model the problems in such dynamic and complex environments. To address these limitations, researchers have proposed Dynamic DCOPs (D-DCOPs) to model how DCOPs dynamically change over time and Continuous DCOPs (C-DCOPs) to model DCOPs with continuous variables. The two models address the limitations of DCOPs but in isolation, and thus, it remains a challenge to model problems that have continuous variables and are in a dynamic environment. Therefore, this dissertation investigates a novel formulation that addresses the two limitations of DCOPs together by modeling both dynamic nature of the environment and continuous nature of the variables. Firstly, we propose Proactive Dynamic DCOPs (PD-DCOPs) which model and solve DCOPs in dynamic environment in a proactive manner. Secondly, we propose several C-DCOP algorithms that are efficient and we provide quality guarantee on their solution. Finally, we propose Dynamic Continuous DCOP (DC-DCOP), a novel formulation that models the DCOPs with continuous variables in a dynamic environment
Recommended from our members
Using RUBI to partition agents in air traffic problems with hard constraints and reward shaping
Air traffic flow management over the U.S. airpsace is a difficult problem. Current management approaches lead to hundreds of thousands of hours of delay, costing billions of dollars annually. Weather and airport conditions may instigate this delay, but routing decisions balancing delay with congestion contribute significantly to the propagation of delays throughout the US airspace. The task of managing delay may be seen as a multiagent congestion problem.
In this problem there are many tightly coupled agents whose actions collectively impact the system, making it difficult for agents to learn how they individually affect the system. Reward shaping is effective at improving agent learning for soft constraint problems by reducing noise caused by interactions with other agents, so we extend those results to hard constraints that cannot be easily learned, and must be enforced algorithmically. Additionally, congestion must be removed from the system in order to ensure a safe environment for all aircraft. This can be done through the use of a greedy scheduler, enforcing a ground delay on each plane to remove congestion from the system, at the cost of delay.
Reward shaping reduces noise caused by agent interactions, and a greedy scheduler removes congestion from the system, but these two approaches cannot be simply combined without a large increase in computational complexity. Agent partitioning can be used to alleviate this complexity by treating each partition of agents as an independent set of agents from other partitions, thus simplifying the computation during each learning step.
Our approach is based on the combination of three different methods to perform hard constraint optimization: a greedy scheduler, reward shaping and agent partitioning. We present two agent partitioning algorithms in conjunction with this approach to simplify the learning domain. The first partitioning algorithm uses system features to compute similarities between agents. This is then used to partition agents into small subsets. To remove the assumption of having domain knowledge, we introduce the Reward/Utility Based Impact (RUBI) algorithm. This algorithm develops an effective similarity matrix while requiring no prior domain knowledge. This similarity matrix can then be used in any similarity based partitioning algorithm. Our results show that autonomous partitioning of the agents using system features leads to up to 450x speed over simple hard constraint enforcement, as well as up to a 37% improvement in performance over a greedy scheduling solution, while using RUBI to partition agents led to a better partitioning of agents, as well as a 510x speed up. Both results correspond to hundreds of hours of delay saved in a single day
Distributed Gibbs: A linear-space sampling-based DCOP algorithm
National Research Foundation (NRF) Singapore under International Research Centres in Singapore Funding Initiativ