2,087 research outputs found
Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward
It has long been recognized that multi-agent reinforcement learning (MARL)
faces significant scalability issues due to the fact that the size of the state
and action spaces are exponentially large in the number of agents. In this
paper, we identify a rich class of networked MARL problems where the model
exhibits a local dependence structure that allows it to be solved in a scalable
manner. Specifically, we propose a Scalable Actor-Critic (SAC) method that can
learn a near optimal localized policy for optimizing the average reward with
complexity scaling with the state-action space size of local neighborhoods, as
opposed to the entire network. Our result centers around identifying and
exploiting an exponential decay property that ensures the effect of agents on
each other decays exponentially fast in their graph distance
Learning Scheduling Algorithms for Data Processing Clusters
Efficiently scheduling data processing jobs on distributed compute clusters
requires complex algorithms. Current systems, however, use simple generalized
heuristics and ignore workload characteristics, since developing and tuning a
scheduling policy for each workload is infeasible. In this paper, we show that
modern machine learning techniques can generate highly-efficient policies
automatically. Decima uses reinforcement learning (RL) and neural networks to
learn workload-specific scheduling algorithms without any human instruction
beyond a high-level objective such as minimizing average job completion time.
Off-the-shelf RL techniques, however, cannot handle the complexity and scale of
the scheduling problem. To build Decima, we had to develop new representations
for jobs' dependency graphs, design scalable RL models, and invent RL training
methods for dealing with continuous stochastic job arrivals. Our prototype
integration with Spark on a 25-node cluster shows that Decima improves the
average job completion time over hand-tuned scheduling heuristics by at least
21%, achieving up to 2x improvement during periods of high cluster load
Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems
We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a Scalable Actor-Critic (SAC) framework that exploits the network structure and finds a localized policy that is a O(ρ^(κ+1))-approximation of a stationary point of the objective for some ρ ∈ (0,1), with complexity that scales with the local state-action space size of the largest κ-hop neighborhood of the network
Scalable Model-based Policy Optimization for Decentralized Networked Systems
Reinforcement learning algorithms require a large amount of samples; this often limits their real-world applications on even simple tasks. Such a challenge is more outstanding in multi-agent tasks, as each step of operation is more costly, requiring communications or shifting or resources. This work aims to improve data efficiency of multi-agent control by model-based learning. We consider networked systems where agents are cooperative and communicate only locally with their neighbors, and propose the decentralized model-based policy optimization framework (DMPO). In our method, each agent learns a dynamic model to predict future states and broadcast their predictions by communication, and then the policies are trained under the model rollouts. To alleviate the bias of model-generated data, we restrain the model usage for generating myopic rollouts, thus reducing the compounding error of model generation. To pertain the independence of policy update, we introduce extended value function and theoretically prove that the resulting policy gradient is a close approximation to true policy gradients. We evaluate our algorithm on several benchmarks for intelligent transportation systems, which are connected autonomous vehicle control tasks (Flow and CACC) and adaptive traffic signal control (ATSC). Empirical results show that our method achieves superior data efficiency and matches the performance of model-free methods using true models. The source code of our algorithm and baselines can be found at https://github.com/PKU-MARL/Model-Based-MARL
Scalable and Sample Efficient Distributed Policy Gradient Algorithms in Multi-Agent Networked Systems
This paper studies a class of multi-agent reinforcement learning (MARL)
problems where the reward that an agent receives depends on the states of other
agents, but the next state only depends on the agent's own current state and
action. We name it REC-MARL standing for REward-Coupled Multi-Agent
Reinforcement Learning. REC-MARL has a range of important applications such as
real-time access control and distributed power control in wireless networks.
This paper presents a distributed and optimal policy gradient algorithm for
REC-MARL. The proposed algorithm is distributed in two aspects: (i) the learned
policy is a distributed policy that maps a local state of an agent to its local
action and (ii) the learning/training is distributed, during which each agent
updates its policy based on its own and neighbors' information. The learned
policy is provably optimal among all local policies and its regret bounds
depend on the dimension of local states and actions. This distinguishes our
result from most existing results on MARL, which often obtain stationary-point
policies. The experimental results of our algorithm for the real-time access
control and power control in wireless networks show that our policy
significantly outperforms the state-of-the-art algorithms and well-known
benchmarks
- …