1,539 research outputs found
Resource Allocation Among Agents with MDP-Induced Preferences
Allocating scarce resources among agents to maximize global utility is, in
general, computationally challenging. We focus on problems where resources
enable agents to execute actions in stochastic environments, modeled as Markov
decision processes (MDPs), such that the value of a resource bundle is defined
as the expected value of the optimal MDP policy realizable given these
resources. We present an algorithm that simultaneously solves the
resource-allocation and the policy-optimization problems. This allows us to
avoid explicitly representing utilities over exponentially many resource
bundles, leading to drastic (often exponential) reductions in computational
complexity. We then use this algorithm in the context of self-interested agents
to design a combinatorial auction for allocating resources. We empirically
demonstrate the effectiveness of our approach by showing that it can, in
minutes, optimally solve problems for which a straightforward combinatorial
resource-allocation technique would require the agents to enumerate up to 2^100
resource bundles and the auctioneer to solve an NP-complete problem with an
input of that size
An Auction-based Coordination Strategy for Task-Constrained Multi-Agent Stochastic Planning with Submodular Rewards
In many domains such as transportation and logistics, search and rescue, or
cooperative surveillance, tasks are pending to be allocated with the
consideration of possible execution uncertainties. Existing task coordination
algorithms either ignore the stochastic process or suffer from the
computational intensity. Taking advantage of the weakly coupled feature of the
problem and the opportunity for coordination in advance, we propose a
decentralized auction-based coordination strategy using a newly formulated
score function which is generated by forming the problem into task-constrained
Markov decision processes (MDPs). The proposed method guarantees convergence
and at least 50% optimality in the premise of a submodular reward function.
Furthermore, for the implementation on large-scale applications, an approximate
variant of the proposed method, namely Deep Auction, is also suggested with the
use of neural networks, which is evasive of the troublesome for constructing
MDPs. Inspired by the well-known actor-critic architecture, two Transformers
are used to map observations to action probabilities and cumulative rewards
respectively. Finally, we demonstrate the performance of the two proposed
approaches in the context of drone deliveries, where the stochastic planning
for the drone league is cast into a stochastic price-collecting Vehicle Routing
Problem (VRP) with time windows. Simulation results are compared with
state-of-the-art methods in terms of solution quality, planning efficiency and
scalability.Comment: 17 pages, 5 figure
Weakly Coupled Deep Q-Networks
We propose weakly coupled deep Q-networks (WCDQN), a novel deep reinforcement
learning algorithm that enhances performance in a class of structured problems
called weakly coupled Markov decision processes (WCMDP). WCMDPs consist of
multiple independent subproblems connected by an action space constraint, which
is a structural property that frequently emerges in practice. Despite this
appealing structure, WCMDPs quickly become intractable as the number of
subproblems grows. WCDQN employs a single network to train multiple DQN
"subagents", one for each subproblem, and then combine their solutions to
establish an upper bound on the optimal action value. This guides the main DQN
agent towards optimality. We show that the tabular version, weakly coupled
Q-learning (WCQL), converges almost surely to the optimal action value.
Numerical experiments show faster convergence compared to DQN and related
techniques in settings with as many as 10 subproblems, total actions,
and a continuous state space.Comment: To appear in proceedings of the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023
Budgeted Reinforcement Learning in Continuous State Space
A Budgeted Markov Decision Process (BMDP) is an extension of a Markov
Decision Process to critical applications requiring safety constraints. It
relies on a notion of risk implemented in the shape of a cost signal
constrained to lie below an - adjustable - threshold. So far, BMDPs could only
be solved in the case of finite state spaces with known dynamics. This work
extends the state-of-the-art to continuous spaces environments and unknown
dynamics. We show that the solution to a BMDP is a fixed point of a novel
Budgeted Bellman Optimality operator. This observation allows us to introduce
natural extensions of Deep Reinforcement Learning algorithms to address
large-scale BMDPs. We validate our approach on two simulated applications:
spoken dialogue and autonomous driving.Comment: N. Carrara and E. Leurent have equally contribute
Endogenous Market Incompleteness Without Market Frictions: Dynamic Suboptimality of Competitive Equilibrium in Multiperiod Overlapping Generations Economies
In this paper, we show that within the set of stochastic three-period-lived OLG economies with productive assets (such as land), markets are necessarily sequentially incomplete, and agents in the model do not share risk optimally. We start by characterizing perfect risk sharing and find that it requires a state-dependent consumption claims which depend only on the exogenous shock realizations. We show then that the recursive competitive equilibrium of any overlapping generations economy with weakly more than three generations is not strongly stationary. This then allows us to show directly that there are short-run Pareto improvements possible in terms of risk-sharing and hence, that the recursive competitive equilibrium is not Pareto optimal. We then show that a financial reform which eliminates the equity asset and replaces it with zero net supply insurance contracts (Arrow securities) will implement to Pareto optimal stochastic steady-state known to exist in the model. Finally, we also show via numerical simulations that a system of government taxes and transfers can lead to a Pareto improvement over the competitive equilibrium in the model.
Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning
We introduce Robust Restless Bandits, a challenging generalization of
restless multi-arm bandits (RMAB). RMABs have been widely studied for
intervention planning with limited resources. However, most works make the
unrealistic assumption that the transition dynamics are known perfectly,
restricting the applicability of existing methods to real-world scenarios. To
make RMABs more useful in settings with uncertain dynamics: (i) We introduce
the Robust RMAB problem and develop solutions for a minimax regret objective
when transitions are given by interval uncertainties; (ii) We develop a double
oracle algorithm for solving Robust RMABs and demonstrate its effectiveness on
three experimental domains; (iii) To enable our double oracle approach, we
introduce RMABPPO, a novel deep reinforcement learning algorithm for solving
RMABs. RMABPPO hinges on learning an auxiliary "-network" that allows
each arm's learning to decouple, greatly reducing sample complexity required
for training; (iv) Under minimax regret, the adversary in the double oracle
approach is notoriously difficult to implement due to non-stationarity. To
address this, we formulate the adversary oracle as a multi-agent reinforcement
learning problem and solve it with a multi-agent extension of RMABPPO, which
may be of independent interest as the first known algorithm for this setting.
Code is available at https://github.com/killian-34/RobustRMAB.Comment: 18 pages, 3 figure
Provably Efficient Generalized Lagrangian Policy Optimization for Safe Multi-Agent Reinforcement Learning
We examine online safe multi-agent reinforcement learning using constrained
Markov games in which agents compete by maximizing their expected total rewards
under a constraint on expected total utilities. Our focus is confined to an
episodic two-player zero-sum constrained Markov game with independent
transition functions that are unknown to agents, adversarial reward functions,
and stochastic utility functions. For such a Markov game, we employ an approach
based on the occupancy measure to formulate it as an online constrained
saddle-point problem with an explicit constraint. We extend the Lagrange
multiplier method in constrained optimization to handle the constraint by
creating a generalized Lagrangian with minimax decision primal variables and a
dual variable. Next, we develop an upper confidence reinforcement learning
algorithm to solve this Lagrangian problem while balancing exploration and
exploitation. Our algorithm updates the minimax decision primal variables via
online mirror descent and the dual variable via projected gradient step and we
prove that it enjoys sublinear rate for
both regret and constraint violation after playing episodes of the game.
Here, is the horizon of each episode, and are the
state/action space sizes of the min-player and the max-player, respectively. To
the best of our knowledge, we provide the first provably efficient online safe
reinforcement learning algorithm in constrained Markov games.Comment: 59 pages, a full version of the main paper in the 5th Annual
Conference on Learning for Dynamics and Contro
- …