9 research outputs found
Adversarially Trained Actor Critic for offline CMDPs
We propose a Safe Adversarial Trained Actor Critic (SATAC) algorithm for
offline reinforcement learning (RL) with general function approximation in the
presence of limited data coverage. SATAC operates as a two-player Stackelberg
game featuring a refined objective function. The actor (leader player)
optimizes the policy against two adversarially trained value critics (follower
players), who focus on scenarios where the actor's performance is inferior to
the behavior policy. Our framework provides both theoretical guarantees and a
robust deep-RL implementation. Theoretically, we demonstrate that when the
actor employs a no-regret optimization oracle, SATAC achieves two guarantees:
(i) For the first time in the offline RL setting, we establish that SATAC can
produce a policy that outperforms the behavior policy while maintaining the
same level of safety, which is critical to designing an algorithm for offline
RL. (ii) We demonstrate that the algorithm guarantees policy improvement across
a broad range of hyperparameters, indicating its practical robustness.
Additionally, we offer a practical version of SATAC and compare it with
existing state-of-the-art offline safe-RL algorithms in continuous control
environments. SATAC outperforms all baselines across a range of tasks, thus
validating the theoretical performance
Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs
This paper considers the best policy identification (BPI) problem in online
Constrained Markov Decision Processes (CMDPs). We are interested in algorithms
that are model-free, have low regret, and identify an optimal policy with a
high probability. Existing model-free algorithms for online CMDPs with
sublinear regret and constraint violation do not provide any convergence
guarantee to an optimal policy and provide only average performance guarantees
when a policy is uniformly sampled at random from all previously used policies.
In this paper, we develop a new algorithm, named
Pruning-Refinement-Identification (PRI), based on a fundamental structural
property of CMDPs proved in Koole(1988); Ross(1989), which we call limited
stochasticity. The property says for a CMDP with constraints, there exists
an optimal policy with at most stochastic decisions.
The proposed algorithm first identifies at which step and in which state a
stochastic decision has to be taken and then fine-tunes the distributions of
these stochastic decisions. PRI achieves trio objectives: (i) PRI is a
model-free algorithm; and (ii) it outputs a near-optimal policy with a high
probability at the end of learning; and (iii) in the tabular setting, PRI
guarantees regret and constraint violation,
which significantly improves the best existing regret bound
under a model-free algorithm, where
is the total number of episodes
A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints
Constrained Markov Decision Processes (CMDPs) formalize sequential
decision-making problems whose objective is to minimize a cost function while
satisfying constraints on various cost functions. In this paper, we consider
the setting of episodic fixed-horizon CMDPs. We propose an online algorithm
which leverages the linear programming formulation of finite-horizon CMDP for
repeated optimistic planning to provide a probably approximately correct (PAC)
guarantee on the number of episodes needed to ensure an -optimal
policy, i.e., with resulting objective value within of the optimal
value and satisfying the constraints within -tolerance, with
probability at least . The number of episodes needed is shown to be
of the order
,
where is the upper bound on the number of possible successor states for a
state-action pair. Therefore, if , the number of episodes needed
have a linear dependence on the state and action space sizes and ,
respectively, and quadratic dependence on the time horizon
Long-Term Fairness with Unknown Dynamics
While machine learning can myopically reinforce social inequalities, it may
also be used to dynamically seek equitable outcomes. In this paper, we
formalize long-term fairness in the context of online reinforcement learning.
This formulation can accommodate dynamical control objectives, such as driving
equity inherent in the state of a population, that cannot be incorporated into
static formulations of fairness. We demonstrate that this framing allows an
algorithm to adapt to unknown dynamics by sacrificing short-term incentives to
drive a classifier-population system towards more desirable equilibria. For the
proposed setting, we develop an algorithm that adapts recent work in online
learning. We prove that this algorithm achieves simultaneous probabilistic
bounds on cumulative loss and cumulative violations of fairness (as statistical
regularities between demographic groups). We compare our proposed algorithm to
the repeated retraining of myopic classifiers, as a baseline, and to a deep
reinforcement learning algorithm that lacks safety guarantees. Our experiments
model human populations according to evolutionary game theory and integrate
real-world datasets
Near-Optimal Sample Complexity Bounds for Constrained MDPs
In contrast to the advances in characterizing the sample complexity for
solving Markov decision processes (MDPs), the optimal statistical complexity
for solving constrained MDPs (CMDPs) remains unknown. We resolve this question
by providing minimax upper and lower bounds on the sample complexity for
learning near-optimal policies in a discounted CMDP with access to a generative
model (simulator). In particular, we design a model-based algorithm that
addresses two settings: (i) relaxed feasibility, where small constraint
violations are allowed, and (ii) strict feasibility, where the output policy is
required to satisfy the constraint. For (i), we prove that our algorithm
returns an -optimal policy with probability , by making
queries to the generative model, thus matching the sample-complexity for
unconstrained MDPs. For (ii), we show that the algorithm's sample complexity is
upper-bounded by where is the problem-dependent Slater
constant that characterizes the size of the feasible region. Finally, we prove
a matching lower-bound for the strict feasibility setting, thus obtaining the
first near minimax optimal bounds for discounted CMDPs. Our results show that
learning CMDPs is as easy as MDPs when small constraint violations are allowed,
but inherently more difficult when we demand zero constraint violation.Comment: NeurIPS'2
Provably Efficient Model-Free Algorithms for Non-stationary CMDPs
We study model-free reinforcement learning (RL) algorithms in episodic
non-stationary constrained Markov Decision Processes (CMDPs), in which an agent
aims to maximize the expected cumulative reward subject to a cumulative
constraint on the expected utility (cost). In the non-stationary environment,
reward, utility functions, and transition kernels can vary arbitrarily over
time as long as the cumulative variations do not exceed certain variation
budgets. We propose the first model-free, simulator-free RL algorithms with
sublinear regret and zero constraint violation for non-stationary CMDPs in both
tabular and linear function approximation settings with provable performance
guarantees. Our results on regret bound and constraint violation for the
tabular case match the corresponding best results for stationary CMDPs when the
total budget is known. Additionally, we present a general framework for
addressing the well-known challenges associated with analyzing non-stationary
CMDPs, without requiring prior knowledge of the variation budget. We apply the
approach for both tabular and linear approximation settings