2,957 research outputs found
Learning with Opponent-Learning Awareness
Multi-agent settings are quickly gathering importance in machine learning.
This includes a plethora of recent work on deep multi-agent reinforcement
learning, but also can be extended to hierarchical RL, generative adversarial
networks and decentralised optimisation. In all these settings the presence of
multiple learning agents renders the training problem non-stationary and often
leads to unstable training or undesired final results. We present Learning with
Opponent-Learning Awareness (LOLA), a method in which each agent shapes the
anticipated learning of the other agents in the environment. The LOLA learning
rule includes a term that accounts for the impact of one agent's policy on the
anticipated parameter update of the other agents. Results show that the
encounter of two LOLA agents leads to the emergence of tit-for-tat and
therefore cooperation in the iterated prisoners' dilemma, while independent
learning does not. In this domain, LOLA also receives higher payouts compared
to a naive learner, and is robust against exploitation by higher order
gradient-based methods. Applied to repeated matching pennies, LOLA agents
converge to the Nash equilibrium. In a round robin tournament we show that LOLA
agents successfully shape the learning of a range of multi-agent learning
algorithms from literature, resulting in the highest average returns on the
IPD. We also show that the LOLA update rule can be efficiently calculated using
an extension of the policy gradient estimator, making the method suitable for
model-free RL. The method thus scales to large parameter and input spaces and
nonlinear function approximators. We apply LOLA to a grid world task with an
embedded social dilemma using recurrent policies and opponent modelling. By
explicitly considering the learning of the other agent, LOLA agents learn to
cooperate out of self-interest. The code is at github.com/alshedivat/lola
Planning for Decentralized Control of Multiple Robots Under Uncertainty
We describe a probabilistic framework for synthesizing control policies for
general multi-robot systems, given environment and sensor models and a cost
function. Decentralized, partially observable Markov decision processes
(Dec-POMDPs) are a general model of decision processes where a team of agents
must cooperate to optimize some objective (specified by a shared reward or cost
function) in the presence of uncertainty, but where communication limitations
mean that the agents cannot share their state, so execution must proceed in a
decentralized fashion. While Dec-POMDPs are typically intractable to solve for
real-world problems, recent research on the use of macro-actions in Dec-POMDPs
has significantly increased the size of problem that can be practically solved
as a Dec-POMDP. We describe this general model, and show how, in contrast to
most existing methods that are specialized to a particular problem class, it
can synthesize control policies that use whatever opportunities for
coordination are present in the problem, while balancing off uncertainty in
outcomes, sensor information, and information about other agents. We use three
variations on a warehouse task to show that a single planner of this type can
generate cooperative behavior using task allocation, direct communication, and
signaling, as appropriate
Scale-free memory model for multiagent reinforcement learning. Mean field approximation and rock-paper-scissors dynamics
A continuous time model for multiagent systems governed by reinforcement
learning with scale-free memory is developed. The agents are assumed to act
independently of one another in optimizing their choice of possible actions via
trial-and-error search. To gain awareness about the action value the agents
accumulate in their memory the rewards obtained from taking a specific action
at each moment of time. The contribution of the rewards in the past to the
agent current perception of action value is described by an integral operator
with a power-law kernel. Finally a fractional differential equation governing
the system dynamics is obtained. The agents are considered to interact with one
another implicitly via the reward of one agent depending on the choice of the
other agents. The pairwise interaction model is adopted to describe this
effect. As a specific example of systems with non-transitive interactions, a
two agent and three agent systems of the rock-paper-scissors type are analyzed
in detail, including the stability analysis and numerical simulation.
Scale-free memory is demonstrated to cause complex dynamics of the systems at
hand. In particular, it is shown that there can be simultaneously two modes of
the system instability undergoing subcritical and supercritical bifurcation,
with the latter one exhibiting anomalous oscillations with the amplitude and
period growing with time. Besides, the instability onset via this supercritical
mode may be regarded as "altruism self-organization". For the three agent
system the instability dynamics is found to be rather irregular and can be
composed of alternate fragments of oscillations different in their properties.Comment: 17 pages, 7 figur
A hybrid cross entropy algorithm for solving dynamic transit network design problem
This paper proposes a hybrid multiagent learning algorithm for solving the
dynamic simulation-based bilevel network design problem. The objective is to
determine the op-timal frequency of a multimodal transit network, which
minimizes total users' travel cost and operation cost of transit lines. The
problem is formulated as a bilevel programming problem with equilibrium
constraints describing non-cooperative Nash equilibrium in a dynamic
simulation-based transit assignment context. A hybrid algorithm combing the
cross entropy multiagent learning algorithm and Hooke-Jeeves algorithm is
proposed. Computational results are provided on the Sioux Falls network to
illustrate the perform-ance of the proposed algorithm
- …