148 research outputs found
Provably Efficient Model-Free Algorithm for MDPs with Peak Constraints
In the optimization of dynamic systems, the variables typically have
constraints. Such problems can be modeled as a Constrained Markov Decision
Process (CMDP). This paper considers the peak constraints, where the agent
chooses the policy to maximize the long-term average reward as well as
satisfies the constraints at each time. We propose a model-free algorithm that
converts CMDP problem to an unconstrained problem and a Q-learning based
approach is used. We extend the concept of probably approximately correct (PAC)
to define a criterion of -optimal policy. The proposed algorithm is
proved to achieve an -optimal policy with probability at least
when the episode , where and
is the number of states and actions, respectively, is the number of
steps per episode, is the number of constraint functions, and
. We note that this is the first result on PAC kind
of analysis for CMDP with peak constraints, where the transition probabilities
are not known apriori. We demonstrate the proposed algorithm on an energy
harvesting problem where it outperforms state-of-the-art and performs close to
the theoretical upper bound of the studied optimization problem
Sample-based Search Methods for Bayes-Adaptive Planning
A fundamental issue for control is acting in the face of uncertainty about the environment. Amongst other things, this induces a trade-off between exploration and exploitation. A model-based Bayesian agent optimizes its return by maintaining a posterior distribution over possible environments, and considering all possible future paths. This optimization is equivalent to solving a Markov Decision Process (MDP) whose hyperstate comprises the agent's beliefs about the environment, as well as its current state in that environment. This corresponding process is called a Bayes-Adaptive MDP (BAMDP). Even for MDPs with only a few states, it is generally intractable to solve the corresponding BAMDP exactly. Various heuristics have been devised, but those that are computationally tractable often perform indifferently, whereas those that perform well are typically so expensive as to be applicable only in small domains with limited structure. Here, we develop new tractable methods for planning in BAMDPs based on recent advances in the solution to large MDPs and general partially observable MDPs. Our algorithms are sample-based, plan online in a way that is focused on the current belief, and, critically, avoid expensive belief updates during simulations. In discrete domains, we use Monte-Carlo tree search to search forward in an aggressive manner. The derived algorithm can scale to large MDPs and provably converges to the Bayes-optimal solution asymptotically. We then consider a more general class of simulation-based methods in which approximation methods can be employed to allow value function estimates to generalize between hyperstates during search. This allows us to tackle continuous domains. We validate our approach empirically in standard domains by comparison with existing approximations. Finally, we explore Bayes-adaptive planning in environments that are modelled by rich, non-parametric probabilistic models. We demonstrate that a fully Bayesian agent can be advantageous in the exploration of complex and even infinite, structured domains
A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning
We present a tutorial on Bayesian optimization, a method of finding the
maximum of expensive cost functions. Bayesian optimization employs the Bayesian
technique of setting a prior over the objective function and combining it with
evidence to get a posterior function. This permits a utility-based selection of
the next observation to make on the objective function, which must take into
account both exploration (sampling from areas of high uncertainty) and
exploitation (sampling areas likely to offer improvement over the current best
observation). We also present two detailed extensions of Bayesian optimization,
with experiments---active user modelling with preferences, and hierarchical
reinforcement learning---and a discussion of the pros and cons of Bayesian
optimization based on our experiences
Provably Learning Nash Policies in Constrained Markov Potential Games
Multi-agent reinforcement learning (MARL) addresses sequential
decision-making problems with multiple agents, where each agent optimizes its
own objective. In many real-world instances, the agents may not only want to
optimize their objectives, but also ensure safe behavior. For example, in
traffic routing, each car (agent) aims to reach its destination quickly
(objective) while avoiding collisions (safety). Constrained Markov Games (CMGs)
are a natural formalism for safe MARL problems, though generally intractable.
In this work, we introduce and study Constrained Markov Potential Games
(CMPGs), an important class of CMGs. We first show that a Nash policy for CMPGs
can be found via constrained optimization. One tempting approach is to solve it
by Lagrangian-based primal-dual methods. As we show, in contrast to the
single-agent setting, however, CMPGs do not satisfy strong duality, rendering
such approaches inapplicable and potentially unsafe. To solve the CMPG problem,
we propose our algorithm Coordinate-Ascent for CMPGs (CA-CMPG), which provably
converges to a Nash policy in tabular, finite-horizon CMPGs. Furthermore, we
provide the first sample complexity bounds for learning Nash policies in
unknown CMPGs, and, which under additional assumptions, guarantee safe
exploration.Comment: 30 page
Safe Model-Based Multi-Agent Mean-Field Reinforcement Learning
Many applications, e.g., in shared mobility, require coordinating a large
number of agents. Mean-field reinforcement learning addresses the resulting
scalability challenge by optimizing the policy of a representative agent. In
this paper, we address an important generalization where there exist global
constraints on the distribution of agents (e.g., requiring capacity constraints
or minimum coverage requirements to be met). We propose Safe--UCRL,
the first model-based algorithm that attains safe policies even in the case of
unknown transition dynamics. As a key ingredient, it uses epistemic uncertainty
in the transition model within a log-barrier approach to ensure pessimistic
constraints satisfaction with high probability. We showcase
Safe--UCRL on the vehicle repositioning problem faced by many
shared mobility operators and evaluate its performance through simulations
built on Shenzhen taxi trajectory data. Our algorithm effectively meets the
demand in critical areas while ensuring service accessibility in regions with
low demand.Comment: 25 pages, 14 figures, 3 table
- ā¦