67 research outputs found
Scalable Multi-Agent Reinforcement Learning for Networked Systems with Average Reward
It has long been recognized that multi-agent reinforcement learning (MARL)
faces significant scalability issues due to the fact that the size of the state
and action spaces are exponentially large in the number of agents. In this
paper, we identify a rich class of networked MARL problems where the model
exhibits a local dependence structure that allows it to be solved in a scalable
manner. Specifically, we propose a Scalable Actor-Critic (SAC) method that can
learn a near optimal localized policy for optimizing the average reward with
complexity scaling with the state-action space size of local neighborhoods, as
opposed to the entire network. Our result centers around identifying and
exploiting an exponential decay property that ensures the effect of agents on
each other decays exponentially fast in their graph distance
Scalable Reinforcement Learning of Localized Policies for Multi-Agent Networked Systems
We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a Scalable Actor-Critic (SAC) framework that exploits the network structure and finds a localized policy that is a O(Ï^(Îș+1))-approximation of a stationary point of the objective for some Ï â (0,1), with complexity that scales with the local state-action space size of the largest Îș-hop neighborhood of the network
Distributed Reinforcement Learning in Multi-Agent Networked Systems
We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems
Distributed Reinforcement Learning in Multi-Agent Networked Systems
We study distributed reinforcement learning (RL) for a network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are local, e.g., between neighbors. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies are non-local and provide a finite-time error bound that shows how the convergence rate depends on the depth of the dependencies in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation that apply beyond the setting of RL in networked systems
Graphical Models and Symmetries : Loopy Belief Propagation Approaches
Whenever a person or an automated system has to reason in uncertain domains, probability theory is necessary. Probabilistic graphical models allow us to build statistical models that capture complex dependencies between random variables. Inference in these models, however, can easily become intractable. Typical ways to address this scaling issue are inference by approximate message-passing, stochastic gradients, and MapReduce, among others. Exploiting the symmetries of graphical models, however, has not yet been considered for scaling statistical machine learning applications. One instance of graphical models that are inherently symmetric are statistical relational models. These have recently gained attraction within the machine learning and AI communities and combine probability theory with first-order logic, thereby allowing for an efficient representation of structured relational domains. The provided formalisms to compactly represent complex real-world domains enable us to effectively describe large problem instances. Inference within and training of graphical models, however, have not been able to keep pace with the increased representational power. This thesis tackles two major aspects of graphical models and shows that both inference and training can indeed benefit from exploiting symmetries. It first deals with efficient inference exploiting symmetries in graphical models for various query types. We introduce lifted loopy belief propagation (lifted LBP), the first lifted parallel inference approach for relational as well as propositional graphical models. Lifted LBP can effectively speed up marginal inference, but cannot straightforwardly be applied to other types of queries. Thus we also demonstrate efficient lifted algorithms for MAP inference and higher order marginals, as well as the efficient handling of multiple inference tasks. Then we turn to the training of graphical models and introduce the first lifted online training for relational models. Our training procedure and the MapReduce lifting for loopy belief propagation combine lifting with the traditional statistical approaches to scaling, thereby bridging the gap between statistical relational learning and traditional statistical machine learning
Recommended from our members
Essays on Dynamic Optimization for Markets and Networks
We study dynamic decision-making problems in networks and markets under uncertainty about future payoffs. This problem is difficult in general since 1) Although the current decision (potentially) affects future decisions, the decision-maker does not have exact information on the future payoffs when he/she commits to the current decision; 2) The decision made at one part of the network usually interacts with the decisions made at the other parts of the network, which makes the computation scales very fast with the network size and brings computational challenges in practice. In this thesis, we propose computationally efficient methods to solve dynamic optimization problems on markets and networks, specify a general set of conditions under which the proposed methods give theoretical guarantees on global near-optimality, and further provide numerical studies to verify the performance empirically. The proposed methods/algorithms have a general theme as âlocal algorithmsâ, meaning that the decision at each node/agent on the network uses only partial information on the network.
In the first part of this thesis, we consider a network model with stochastic uncertainty about future payoffs. The network has a bounded degree, and each node takes a discrete decision at each period, leading to a per-period payoff which is a sum of three parts: node rewards for individual node decisions, temporal interactions between individual node decisions from the current and previous periods, and spatial interactions between decisions from pairs of neighboring nodes. The objective is to maximize the expected total payoffs over a finite horizon. We study a natural decentralized algorithm (whose computational requirement is linear in the network size and planning horizon) and prove that our decentralized algorithm achieves global near-optimality when temporal and spatial interactions are not dominant compared to the randomness in node rewards. Decentralized algorithms are parameterized by the locality parameter L: An L-local algorithm makes its decision at each node v based on current and (simulated) future payoffs only up to L periods ahead, and only in an L-radius neighborhood around v. Given any permitted error Δ > 0, we show that our proposed L-local algorithm with L = O(log(1/Δ)) has an average per-node-per- period optimality gap bounded above by Δ, in networks where temporal and spatial interactions are not dominant. This constitutes the first theoretical result establishing the global near-optimality of a local algorithm for network dynamic optimization.
In the second part of this thesis, we consider the previous three types of payoff functions under adversarial uncertainty about the future. In general, there are no performance guarantees for arbitrary payoff functions. We consider an additional convexity structure in the individual node payoffs and interaction functions, which helps us leverage the tools in the broad Online Convex Optimization literature. In this work, we study the setting where there is a trade-off between developing future predictions for a longer lookahead horizon, denoted as k versus increasing spatial radius for decentralized computation, denoted as r. When deciding individual node decisions at each time, each node has access to predictions of local cost functions for the next k time steps in an r-hop neighborhood. Our work proposes a novel online algorithm, Localized Predictive Control (LPC), which generalizes predictive control to multi-agent systems. We show that LPC achieves a competitive ratio approaching to 1 exponentially fast in ÏT and ÏS in an adversarial setting, where ÏT and ÏS are constants in (0, 1) that increase with the relative strength of temporal and spatial interaction costs, respectively. This is the first competitive ratio bound on decentralized predictive control for networked online convex optimization. Further, we show that the dependence on k and r in our results is near-optimal by lower bounding the competitive ratio of any decentralized online algorithm.
In the third part of this work, we consider a general dynamic matching model for online competitive gaming platforms. Players arrive stochastically with a skill attribute, the Elo rating. The distribution of Elo is known and i.i.d across players. However, the individualâs rating is only observed upon arrival. Matching two players with different skills incurs a match cost. The goal is tominimize a weighted combination of waiting costs and matching costs in the system. We investigate a popular heuristic used in industry to trade-off between these two costs, the Bubble algorithm. The algorithm places arriving players on the Elo line with a growing bubble around them. When two bubbles touch, the two players get matched. We show that, with the optimal bubble expansion rate, the Bubble algorithm achieves a constant factor ratio against the offline optimal cost when the match cost (resp. waiting cost) is a power of Elo difference (resp. waiting time). We use playersâ activity logs data from a gaming start-up to validate our approach and further provide guidance on how to tune the Bubble expansion rate in practice
- âŠ