8,324 research outputs found
Multi-Agent Deep Reinforcement Learning for Large-scale Traffic Signal Control
Reinforcement learning (RL) is a promising data-driven approach for adaptive
traffic signal control (ATSC) in complex urban traffic networks, and deep
neural networks further enhance its learning power. However, centralized RL is
infeasible for large-scale ATSC due to the extremely high dimension of the
joint action space. Multi-agent RL (MARL) overcomes the scalability issue by
distributing the global control to each local RL agent, but it introduces new
challenges: now the environment becomes partially observable from the viewpoint
of each local agent due to limited communication among agents. Most existing
studies in MARL focus on designing efficient communication and coordination
among traditional Q-learning agents. This paper presents, for the first time, a
fully scalable and decentralized MARL algorithm for the state-of-the-art deep
RL agent: advantage actor critic (A2C), within the context of ATSC. In
particular, two methods are proposed to stabilize the learning procedure, by
improving the observability and reducing the learning difficulty of each local
agent. The proposed multi-agent A2C is compared against independent A2C and
independent Q-learning algorithms, in both a large synthetic traffic grid and a
large real-world traffic network of Monaco city, under simulated peak-hour
traffic dynamics. Results demonstrate its optimality, robustness, and sample
efficiency over other state-of-the-art decentralized MARL algorithms
Adaptive Variance for Changing Sparse-Reward Environments
Robots that are trained to perform a task in a fixed environment often fail
when facing unexpected changes to the environment due to a lack of exploration.
We propose a principled way to adapt the policy for better exploration in
changing sparse-reward environments. Unlike previous works which explicitly
model environmental changes, we analyze the relationship between the value
function and the optimal exploration for a Gaussian-parameterized policy and
show that our theory leads to an effective strategy for adjusting the variance
of the policy, enabling fast adapt to changes in a variety of sparse-reward
environments.Comment: Accepted as a conference at International Conference on Robotics and
Automation(ICRA) 201
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent
years. This has led to a dramatic increase in the number of applications and
methods. Recent works have explored learning beyond single-agent scenarios and
have considered multiagent learning (MAL) scenarios. Initial results report
successes in complex multiagent domains, although there are several challenges
to be addressed. The primary goal of this article is to provide a clear
overview of current multiagent deep reinforcement learning (MDRL) literature.
Additionally, we complement the overview with a broader analysis: (i) we
revisit previous key components, originally presented in MAL and RL, and
highlight how they have been adapted to multiagent deep reinforcement learning
settings. (ii) We provide general guidelines to new practitioners in the area:
describing lessons learned from MDRL works, pointing to recent benchmarks, and
outlining open avenues of research. (iii) We take a more critical tone raising
practical challenges of MDRL (e.g., implementation and computational demands).
We expect this article will help unify and motivate future research to take
advantage of the abundant literature that exists (e.g., RL and MAL) in a joint
effort to promote fruitful research in the multiagent community.Comment: Under review since Oct 2018. Earlier versions of this work had the
title: "Is multiagent deep reinforcement learning the answer or the question?
A brief survey
Managing engineering systems with large state and action spaces through deep reinforcement learning
Decision-making for engineering systems can be efficiently formulated as a
Markov Decision Process (MDP) or a Partially Observable MDP (POMDP). Typical
MDP and POMDP solution procedures utilize offline knowledge about the
environment and provide detailed policies for relatively small systems with
tractable state and action spaces. However, in large multi-component systems
the sizes of these spaces easily explode, as system states and actions scale
exponentially with the number of components, whereas environment dynamics are
difficult to be described in explicit forms for the entire system and may only
be accessible through numerical simulators. In this work, to address these
issues, an integrated Deep Reinforcement Learning (DRL) framework is
introduced. The Deep Centralized Multi-agent Actor Critic (DCMAC) is developed,
an off-policy actor-critic DRL approach, providing efficient life-cycle
policies for large multi-component systems operating in high-dimensional
spaces. Apart from deep function approximations that parametrize large state
spaces, DCMAC also adopts a factorized representation of the system actions,
being able to designate individualized component- and subsystem-level
decisions, while maintaining a centralized value function for the entire
system. DCMAC compares well against Deep Q-Network (DQN) solutions and exact
policies, where applicable, and outperforms optimized baselines that are based
on time-based, condition-based and periodic policies
Stein Variational Policy Gradient
Policy gradient methods have been successfully applied to many complex
reinforcement learning problems. However, policy gradient methods suffer from
high variance, slow convergence, and inefficient exploration. In this work, we
introduce a maximum entropy policy optimization framework which explicitly
encourages parameter exploration, and show that this framework can be reduced
to a Bayesian inference problem. We then propose a novel Stein variational
policy gradient method (SVPG) which combines existing policy gradient methods
and a repulsive functional to generate a set of diverse but well-behaved
policies. SVPG is robust to initialization and can easily be implemented in a
parallel manner. On continuous control problems, we find that implementing SVPG
on top of REINFORCE and advantage actor-critic algorithms improves both average
return and data efficiency
Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning
Although reinforcement learning methods can achieve impressive results in
simulation, the real world presents two major challenges: generating samples is
exceedingly expensive, and unexpected perturbations or unseen situations cause
proficient but specialized policies to fail at test time. Given that it is
impractical to train separate policies to accommodate all situations the agent
may see in the real world, this work proposes to learn how to quickly and
effectively adapt online to new tasks. To enable sample-efficient learning, we
consider learning online adaptation in the context of model-based reinforcement
learning. Our approach uses meta-learning to train a dynamics model prior such
that, when combined with recent data, this prior can be rapidly adapted to the
local context. Our experiments demonstrate online adaptation for continuous
control tasks on both simulated and real-world agents. We first show simulated
agents adapting their behavior online to novel terrains, crippled body parts,
and highly-dynamic environments. We also illustrate the importance of
incorporating online adaptation into autonomous agents that operate in the real
world by applying our method to a real dynamic legged millirobot. We
demonstrate the agent's learned ability to quickly adapt online to a missing
leg, adjust to novel terrains and slopes, account for miscalibration or errors
in pose estimation, and compensate for pulling payloads.Comment: First 2 authors contributed equally. Website:
https://sites.google.com/berkeley.edu/metaadaptivecontro
Continuous control with deep reinforcement learning
We adapt the ideas underlying the success of Deep Q-Learning to the
continuous action domain. We present an actor-critic, model-free algorithm
based on the deterministic policy gradient that can operate over continuous
action spaces. Using the same learning algorithm, network architecture and
hyper-parameters, our algorithm robustly solves more than 20 simulated physics
tasks, including classic problems such as cartpole swing-up, dexterous
manipulation, legged locomotion and car driving. Our algorithm is able to find
policies whose performance is competitive with those found by a planning
algorithm with full access to the dynamics of the domain and its derivatives.
We further demonstrate that for many of the tasks the algorithm can learn
policies end-to-end: directly from raw pixel inputs.Comment: 10 pages + supplementar
Neuronal Circuit Policies
We propose an effective way to create interpretable control agents, by
re-purposing the function of a biological neural circuit model, to govern
simulated and real world reinforcement learning (RL) test-beds. We model the
tap-withdrawal (TW) neural circuit of the nematode, C. elegans, a circuit
responsible for the worm's reflexive response to external mechanical touch
stimulations, and learn its synaptic and neuronal parameters as a policy for
controlling basic RL tasks. We also autonomously park a real rover robot on a
pre-defined trajectory, by deploying such neuronal circuit policies learned in
a simulated environment. For reconfiguration of the purpose of the TW neural
circuit, we adopt a search-based RL algorithm. We show that our neuronal
policies perform as good as deep neural network policies with the advantage of
realizing interpretable dynamics at the cell level
Multi-Agent Actor-Critic with Generative Cooperative Policy Network
We propose an efficient multi-agent reinforcement learning approach to derive
equilibrium strategies for multi-agents who are participating in a Markov game.
Mainly, we are focused on obtaining decentralized policies for agents to
maximize the performance of a collaborative task by all the agents, which is
similar to solving a decentralized Markov decision process. We propose to use
two different policy networks: (1) decentralized greedy policy network used to
generate greedy action during training and execution period and (2) generative
cooperative policy network (GCPN) used to generate action samples to make other
agents improve their objectives during training period. We show that the
samples generated by GCPN enable other agents to explore the policy space more
effectively and favorably to reach a better policy in terms of achieving the
collaborative tasks.Comment: 10 pages, total 9 figures including all sub-figure
Bayesian Transfer Reinforcement Learning with Prior Knowledge Rules
We propose a probabilistic framework to directly insert prior knowledge in
reinforcement learning (RL) algorithms by defining the behaviour policy as a
Bayesian posterior distribution. Such a posterior combines task specific
information with prior knowledge, thus allowing to achieve transfer learning
across tasks. The resulting method is flexible and it can be easily
incorporated to any standard off-policy and on-policy algorithms, such as those
based on temporal differences and policy gradients. We develop a specific
instance of this Bayesian transfer RL framework by expressing prior knowledge
as general deterministic rules that can be useful in a large variety of tasks,
such as navigation tasks. Also, we elaborate more on recent probabilistic and
entropy-regularised RL by developing a novel temporal learning algorithm and
show how to combine it with Bayesian transfer RL. Finally, we demonstrate our
method for solving mazes and show that significant speed ups can be obtained.Comment: 11 pages, 2 figure
- …