15,859 research outputs found
Reinforcement Learning with Convex Constraints
In standard reinforcement learning (RL), a learning agent seeks to optimize
the overall reward. However, many key aspects of a desired behavior are more
naturally expressed as constraints. For instance, the designer may want to
limit the use of unsafe actions, increase the diversity of trajectories to
enable exploration, or approximate expert trajectories when rewards are sparse.
In this paper, we propose an algorithmic scheme that can handle a wide class of
constraints in RL tasks: specifically, any constraints that require expected
values of some vector measurements (such as the use of an action) to lie in a
convex set. This captures previously studied constraints (such as safety and
proximity to an expert), but also enables new classes of constraints (such as
diversity). Our approach comes with rigorous theoretical guarantees and only
relies on the ability to approximately solve standard RL tasks. As a result, it
can be easily adapted to work with any model-free or model-based RL. In our
experiments, we show that it matches previous algorithms that enforce safety
via constraints, but can also enforce new properties that these algorithms do
not incorporate, such as diversity
Safe Reinforcement Learning of Control-Affine Systems with Vertex Networks
This paper focuses on finding reinforcement learning policies for control
systems with hard state and action constraints. Despite its success in many
domains, reinforcement learning is challenging to apply to problems with hard
constraints, especially if both the state variables and actions are
constrained. Previous works seeking to ensure constraint satisfaction, or
safety, have focused on adding a projection step to a learned policy. Yet, this
approach requires solving an optimization problem at every policy execution
step, which can lead to significant computational costs.
To tackle this problem, this paper proposes a new approach, termed Vertex
Networks (VNs), with guarantees on safety during exploration and on learned
control policies by incorporating the safety constraints into the policy
network architecture. Leveraging the geometric property that all points within
a convex set can be represented as the convex combination of its vertices, the
proposed algorithm first learns the convex combination weights and then uses
these weights along with the pre-calculated vertices to output an action. The
output action is guaranteed to be safe by construction. Numerical examples
illustrate that the proposed VN algorithm outperforms vanilla reinforcement
learning in a variety of benchmark control tasks
Convergent Policy Optimization for Safe Reinforcement Learning
We study the safe reinforcement learning problem with nonlinear function
approximation, where policy optimization is formulated as a constrained
optimization problem with both the objective and the constraint being nonconvex
functions. For such a problem, we construct a sequence of surrogate convex
constrained optimization problems by replacing the nonconvex functions locally
with convex quadratic functions obtained from policy gradient estimators. We
prove that the solutions to these surrogate problems converge to a stationary
point of the original nonconvex problem. Furthermore, to extend our theoretical
results, we apply our algorithm to examples of optimal control and multi-agent
reinforcement learning with safety constraints
Constrained episodic reinforcement learning in concave-convex and knapsack settings
We propose an algorithm for tabular episodic reinforcement learning with
constraints. We provide a modular analysis with strong theoretical guarantees
for settings with concave rewards and convex constraints, and for settings with
hard constraints (knapsacks). Most of the previous work in constrained
reinforcement learning is limited to linear constraints, and the remaining work
focuses on either the feasibility question or settings with a single episode.
Our experiments demonstrate that the proposed algorithm significantly
outperforms these approaches in existing constrained episodic environments.Comment: The NeurIPS 2020 version of this paper includes a small bug, leading
to an incorrect dependence on H in Theorem 3.4. This version fixes it by
adjusting Eq. (9), Theorem 3.4 and the relevant proofs. Changes in the main
text are noted in red. Changes in the appendix are limited to Appendices B.1,
B.5, and B.6 and the statement of Lemma F.
Opinion shaping in social networks using reinforcement learning
In this paper, we study how to shape opinions in social networks when the
matrix of interactions is unknown. We consider classical opinion dynamics with
some stubborn agents and the possibility of continuously influencing the
opinions of a few selected agents, albeit under resource constraints. We map
the opinion dynamics to a value iteration scheme for policy evaluation for a
specific stochastic shortest path problem. This leads to a representation of
the opinion vector as an approximate value function for a stochastic shortest
path problem with some non-classical constraints. We suggest two possible ways
of influencing agents. One leads to a convex optimization problem and the other
to a non-convex one. Firstly, for both problems, we propose two different
online two-time scale reinforcement learning schemes that converge to the
optimal solution of each problem. Secondly, we suggest stochastic gradient
descent schemes and compare these classes of algorithms with the two-time scale
reinforcement learning schemes. Thirdly, we also derive another algorithm
designed to tackle the curse of dimensionality one faces when all agents are
observed. Numerical studies are provided to illustrate the convergence and
efficiency of our algorithms
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
In this paper, we set forth a new vision of reinforcement learning developed
by us over the past few years, one that yields mathematically rigorous
solutions to longstanding important questions that have remained unresolved:
(i) how to design reliable, convergent, and robust reinforcement learning
algorithms (ii) how to guarantee that reinforcement learning satisfies
pre-specified "safety" guarantees, and remains in a stable region of the
parameter space (iii) how to design "off-policy" temporal difference learning
algorithms in a reliable and stable manner, and finally (iv) how to integrate
the study of reinforcement learning into the rich theory of stochastic
optimization. In this paper, we provide detailed answers to all these questions
using the powerful framework of proximal operators.
The key idea that emerges is the use of primal dual spaces connected through
the use of a Legendre transform. This allows temporal difference updates to
occur in dual spaces, allowing a variety of important technical advantages. The
Legendre transform elegantly generalizes past algorithms for solving
reinforcement learning problems, such as natural gradient methods, which we
show relate closely to the previously unconnected framework of mirror descent
methods. Equally importantly, proximal operator theory enables the systematic
development of operator splitting methods that show how to safely and reliably
decompose complex products of gradients that occur in recent variants of
gradient-based temporal difference learning. This key technical innovation
makes it possible to finally design "true" stochastic gradient methods for
reinforcement learning. Finally, Legendre transforms enable a variety of other
benefits, including modeling sparsity and domain geometry. Our work builds
extensively on recent work on the convergence of saddle-point algorithms, and
on the theory of monotone operators.Comment: 121 page
Optimal Control Via Neural Networks: A Convex Approach
Control of complex systems involves both system identification and controller
design. Deep neural networks have proven to be successful in many
identification tasks, however, from model-based control perspective, these
networks are difficult to work with because they are typically nonlinear and
nonconvex. Therefore many systems are still identified and controlled based on
simple linear models despite their poor representation capability. In this
paper we bridge the gap between model accuracy and control tractability faced
by neural networks, by explicitly constructing networks that are convex with
respect to their inputs. We show that these input convex networks can be
trained to obtain accurate models of complex physical systems. In particular,
we design input convex recurrent neural networks to capture temporal behavior
of dynamical systems. Then optimal controllers can be achieved via solving a
convex model predictive control problem. Experiment results demonstrate the
good potential of the proposed input convex neural network based approach in a
variety of control applications. In particular we show that in the MuJoCo
locomotion tasks, we could achieve over 10% higher performance using 5* less
time compared with state-of-the-art model-based reinforcement learning method;
and in the building HVAC control example, our method achieved up to 20% energy
reduction compared with classic linear models.Comment: Published as a conference paper at ICLR 2019:
https://openreview.net/forum?id=H1MW72AcK
Differentiable MPC for End-to-end Planning and Control
We present foundations for using Model Predictive Control (MPC) as a
differentiable policy class for reinforcement learning in continuous state and
action spaces. This provides one way of leveraging and combining the advantages
of model-free and model-based approaches. Specifically, we differentiate
through MPC by using the KKT conditions of the convex approximation at a fixed
point of the controller. Using this strategy, we are able to learn the cost and
dynamics of a controller via end-to-end learning. Our experiments focus on
imitation learning in the pendulum and cartpole domains, where we learn the
cost and dynamics terms of an MPC policy class. We show that our MPC policies
are significantly more data-efficient than a generic neural network and that
our method is superior to traditional system identification in a setting where
the expert is unrealizable.Comment: NeurIPS 201
A unified view of entropy-regularized Markov decision processes
We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup
Generative Adversarial Imitation Learning
Consider learning a policy from example expert behavior, without interaction
with the expert or access to reinforcement signal. One approach is to recover
the expert's cost function with inverse reinforcement learning, then extract a
policy from that cost function with reinforcement learning. This approach is
indirect and can be slow. We propose a new general framework for directly
extracting a policy from data, as if it were obtained by reinforcement learning
following inverse reinforcement learning. We show that a certain instantiation
of our framework draws an analogy between imitation learning and generative
adversarial networks, from which we derive a model-free imitation learning
algorithm that obtains significant performance gains over existing model-free
methods in imitating complex behaviors in large, high-dimensional environments
- …