962 research outputs found
Robust and Risk-Sensitive Markov Decision Processes with Applications to Dynamic Optimal Reinsurance
Reliable Off-policy Evaluation for Reinforcement Learning
In a sequential decision-making problem, off-policy evaluation estimates the
expected cumulative reward of a target policy using logged trajectory data
generated from a different behavior policy, without execution of the target
policy. Reinforcement learning in high-stake environments, such as healthcare
and education, is often limited to off-policy settings due to safety or ethical
concerns, or inability of exploration. Hence it is imperative to quantify the
uncertainty of the off-policy estimate before deployment of the target policy.
In this paper, we propose a novel framework that provides robust and optimistic
cumulative reward estimates using one or multiple logged trajectories data.
Leveraging methodologies from distributionally robust optimization, we show
that with proper selection of the size of the distributional uncertainty set,
these estimates serve as confidence bounds with non-asymptotic and asymptotic
guarantees under stochastic or adversarial environments. Our results are also
generalized to batch reinforcement learning and are supported by empirical
analysis.Comment: 39 pages, 4 figure
Episodic Bayesian Optimal Control with Unknown Randomness Distributions
Stochastic optimal control with unknown randomness distributions has been
studied for a long time, encompassing robust control, distributionally robust
control, and adaptive control. We propose a new episodic Bayesian approach that
incorporates Bayesian learning with optimal control. In each episode, the
approach learns the randomness distribution with a Bayesian posterior and
subsequently solves the corresponding Bayesian average estimate of the true
problem. The resulting policy is exercised during the episode, while additional
data/observations of the randomness are collected to update the Bayesian
posterior for the next episode. We show that the resulting episodic value
functions and policies converge almost surely to their optimal counterparts of
the true problem if the parametrized model of the randomness distribution is
correctly specified. We further show that the asymptotic convergence rate of
the episodic value functions is of the order . We develop an
efficient computational method based on stochastic dual dynamic programming for
a class of problems that have convex value functions. Our numerical results on
a classical inventory control problem verify the theoretical convergence
results and demonstrate the effectiveness of the proposed computational method
Probabilistic Guarantees for Safe Deep Reinforcement Learning
Deep reinforcement learning has been successfully applied to many control
tasks, but the application of such agents in safety-critical scenarios has been
limited due to safety concerns. Rigorous testing of these controllers is
challenging, particularly when they operate in probabilistic environments due
to, for example, hardware faults or noisy sensors. We propose MOSAIC, an
algorithm for measuring the safety of deep reinforcement learning agents in
stochastic settings. Our approach is based on the iterative construction of a
formal abstraction of a controller's execution in an environment, and leverages
probabilistic model checking of Markov decision processes to produce
probabilistic guarantees on safe behaviour over a finite time horizon. It
produces bounds on the probability of safe operation of the controller for
different initial configurations and identifies regions where correct behaviour
can be guaranteed. We implement and evaluate our approach on agents trained for
several benchmark control problems
Recommended from our members
Abstractions in Reasoning for Long-Term Autonomy
The path to building adaptive, robust, intelligent agents has led researchers to develop a suite of powerful models and algorithms for agents with a single objective. However, in recent years, attempts to use this monolithic approach to solve an ever-expanding set of complex real-world problems, which increasingly include long-term autonomous deployments, have illuminated challenges in its ability to scale. Consequently, a fragmented collection of hierarchical and multi-objective models were developed. This trend continues into the algorithms as well, as each approximates an optimal solution in a different manner for scalability. These models and algorithms represent an attempt to solve pieces of an overarching problem: how can an agent explicitly model and integrate the necessary aspects of reasoning required to achieve long-term autonomy?
This thesis presents a general hierarchical and multi-objective model called a policy network that unifies prior fragmented solutions into a single graphical decision-making structure. Policy networks are broadly useful to solve numerous real-world problems. This thesis focuses on autonomous vehicle (AV) problems: (1) route-planning with multiple objectives; (2) semi-autonomy with proactive transfer of control; and (3) intersection decision-making for reasoning online about any number of other vehicles and pedestrians. Formal models are presented for each of the distinct problems. Solutions are evaluated using real-world map data in simulation and demonstrated on a fully operational AV prototype driving on real public roads. Policy networks serve as a shared underlying framework for all three, enabling their seamless integration as parts of an overall solution for rich, real-world, scalable decision-making in agents with long-term autonomy
From Infinite to Finite Programs: Explicit Error Bounds with Applications to Approximate Dynamic Programming
We consider linear programming (LP) problems in infinite dimensional spaces
that are in general computationally intractable. Under suitable assumptions, we
develop an approximation bridge from the infinite-dimensional LP to tractable
finite convex programs in which the performance of the approximation is
quantified explicitly. To this end, we adopt the recent developments in two
areas of randomized optimization and first order methods, leading to a priori
as well as a posterior performance guarantees. We illustrate the generality and
implications of our theoretical results in the special case of the long-run
average cost and discounted cost optimal control problems for Markov decision
processes on Borel spaces. The applicability of the theoretical results is
demonstrated through a constrained linear quadratic optimal control problem and
a fisheries management problem.Comment: 30 pages, 5 figure
- …