1,005 research outputs found
Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces
In this paper, we set forth a new vision of reinforcement learning developed
by us over the past few years, one that yields mathematically rigorous
solutions to longstanding important questions that have remained unresolved:
(i) how to design reliable, convergent, and robust reinforcement learning
algorithms (ii) how to guarantee that reinforcement learning satisfies
pre-specified "safety" guarantees, and remains in a stable region of the
parameter space (iii) how to design "off-policy" temporal difference learning
algorithms in a reliable and stable manner, and finally (iv) how to integrate
the study of reinforcement learning into the rich theory of stochastic
optimization. In this paper, we provide detailed answers to all these questions
using the powerful framework of proximal operators.
The key idea that emerges is the use of primal dual spaces connected through
the use of a Legendre transform. This allows temporal difference updates to
occur in dual spaces, allowing a variety of important technical advantages. The
Legendre transform elegantly generalizes past algorithms for solving
reinforcement learning problems, such as natural gradient methods, which we
show relate closely to the previously unconnected framework of mirror descent
methods. Equally importantly, proximal operator theory enables the systematic
development of operator splitting methods that show how to safely and reliably
decompose complex products of gradients that occur in recent variants of
gradient-based temporal difference learning. This key technical innovation
makes it possible to finally design "true" stochastic gradient methods for
reinforcement learning. Finally, Legendre transforms enable a variety of other
benefits, including modeling sparsity and domain geometry. Our work builds
extensively on recent work on the convergence of saddle-point algorithms, and
on the theory of monotone operators.Comment: 121 page
Boosting the Actor with Dual Critic
This paper proposes a new actor-critic-style algorithm called Dual
Actor-Critic or Dual-AC. It is derived in a principled way from the Lagrangian
dual form of the Bellman optimality equation, which can be viewed as a
two-player game between the actor and a critic-like function, which is named as
dual critic. Compared to its actor-critic relatives, Dual-AC has the desired
property that the actor and dual critic are updated cooperatively to optimize
the same objective function, providing a more transparent way for learning the
critic that is directly related to the objective function of the actor. We then
provide a concrete algorithm that can effectively solve the minimax
optimization problem, using techniques of multi-step bootstrapping, path
regularization, and stochastic dual ascent algorithm. We demonstrate that the
proposed algorithm achieves the state-of-the-art performances across several
benchmarks.Comment: 21 pages, 9 figure
An Actor-Critic Contextual Bandit Algorithm for Personalized Mobile Health Interventions
Increasing technological sophistication and widespread use of smartphones and
wearable devices provide opportunities for innovative and highly personalized
health interventions. A Just-In-Time Adaptive Intervention (JITAI) uses
real-time data collection and communication capabilities of modern mobile
devices to deliver interventions in real-time that are adapted to the
in-the-moment needs of the user. The lack of methodological guidance in
constructing data-based JITAIs remains a hurdle in advancing JITAI research
despite the increasing popularity of JITAIs among clinical scientists. In this
article, we make a first attempt to bridge this methodological gap by
formulating the task of tailoring interventions in real-time as a contextual
bandit problem. Interpretability requirements in the domain of mobile health
lead us to formulate the problem differently from existing formulations
intended for web applications such as ad or news article placement. Under the
assumption of linear reward function, we choose the reward function (the
"critic") parameterization separately from a lower dimensional parameterization
of stochastic policies (the "actor"). We provide an online actor-critic
algorithm that guides the construction and refinement of a JITAI. Asymptotic
properties of the actor-critic algorithm are developed and backed up by
numerical experiments. Additional numerical experiments are conducted to test
the robustness of the algorithm when idealized assumptions used in the analysis
of contextual bandit algorithm are breached
Bridging the Gap Between Value and Policy Based Reinforcement Learning
We establish a new connection between value and policy based reinforcement
learning (RL) based on a relationship between softmax temporal value
consistency and policy optimality under entropy regularization. Specifically,
we show that softmax consistent action values correspond to optimal entropy
regularized policy probabilities along any action sequence, regardless of
provenance. From this observation, we develop a new RL algorithm, Path
Consistency Learning (PCL), that minimizes a notion of soft consistency error
along multi-step action sequences extracted from both on- and off-policy
traces. We examine the behavior of PCL in different scenarios and show that PCL
can be interpreted as generalizing both actor-critic and Q-learning algorithms.
We subsequently deepen the relationship by showing how a single model can be
used to represent both a policy and the corresponding softmax state values,
eliminating the need for a separate critic. The experimental evaluation
demonstrates that PCL significantly outperforms strong actor-critic and
Q-learning baselines across several benchmarks.Comment: NIPS 201
Continuous-Time Mean-Variance Portfolio Selection: A Reinforcement Learning Framework
We approach the continuous-time mean-variance (MV) portfolio selection with
reinforcement learning (RL). The problem is to achieve the best tradeoff
between exploration and exploitation, and is formulated as an
entropy-regularized, relaxed stochastic control problem. We prove that the
optimal feedback policy for this problem must be Gaussian, with time-decaying
variance. We then establish connections between the entropy-regularized MV and
the classical MV, including the solvability equivalence and the convergence as
exploration weighting parameter decays to zero. Finally, we prove a policy
improvement theorem, based on which we devise an implementable RL algorithm. We
find that our algorithm outperforms both an adaptive control based method and a
deep neural networks based algorithm by a large margin in our simulations.Comment: 39 pages, 5 figure
Two-stage Deep Reinforcement Learning for Inverter-based Volt-VAR Control in Active Distribution Networks
Model-based Vol/VAR optimization method is widely used to eliminate voltage
violations and reduce network losses. However, the parameters of active
distribution networks(ADNs) are not onsite identified, so significant errors
may be involved in the model and make the model-based method infeasible. To
cope with this critical issue, we propose a novel two-stage deep reinforcement
learning (DRL) method to improve the voltage profile by regulating
inverter-based energy resources, which consists of offline stage and online
stage. In the offline stage, a highly efficient adversarial reinforcement
learning algorithm is developed to train an offline agent robust to the model
mismatch. In the sequential online stage, we transfer the offline agent safely
as the online agent to perform continuous learning and controlling online with
significantly improved safety and efficiency. Numerical simulations on IEEE
test cases not only demonstrate that the proposed adversarial reinforcement
learning algorithm outperforms the state-of-art algorithm, but also show that
our proposed two-stage method achieves much better performance than the
existing DRL based methods in the online application.Comment: 8 page
Reinforcement Learning with Deep Energy-Based Policies
We propose a method for learning expressive energy-based policies for
continuous states and actions, which has been feasible only in tabular domains
before. We apply our method to learning maximum entropy policies, resulting
into a new algorithm, called soft Q-learning, that expresses the optimal policy
via a Boltzmann distribution. We use the recently proposed amortized Stein
variational gradient descent to learn a stochastic sampling network that
approximates samples from this distribution. The benefits of the proposed
algorithm include improved exploration and compositionality that allows
transferring skills between tasks, which we confirm in simulated experiments
with swimming and walking robots. We also draw a connection to actor-critic
methods, which can be viewed performing approximate inference on the
corresponding energy-based model
Relative Entropy Regularized Policy Iteration
We present an off-policy actor-critic algorithm for Reinforcement Learning
(RL) that combines ideas from gradient-free optimization via stochastic search
with learned action-value function. The result is a simple procedure consisting
of three steps: i) policy evaluation by estimating a parametric action-value
function; ii) policy improvement via the estimation of a local non-parametric
policy; and iii) generalization by fitting a parametric policy. Each step can
be implemented in different ways, giving rise to several algorithm variants.
Our algorithm draws on connections to existing literature on black-box
optimization and 'RL as an inference' and it can be seen either as an extension
of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et
al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation
Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997]
to a policy iteration scheme. Our comparison on 31 continuous control tasks
from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al.,
2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited
amount of compute and a single set of hyperparameters, demonstrate the
effectiveness of our method and the state of art results. Videos, summarizing
results, can be found at goo.gl/HtvJKR
Merging Deterministic Policy Gradient Estimations with Varied Bias-Variance Tradeoff for Effective Deep Reinforcement Learning
Deep reinforcement learning (DRL) on Markov decision processes (MDPs) with
continuous action spaces is often approached by directly training parametric
policies along the direction of estimated policy gradients (PGs). Previous
research revealed that the performance of these PG algorithms depends heavily
on the bias-variance tradeoffs involved in estimating and using PGs. A notable
approach towards balancing this tradeoff is to merge both on-policy and
off-policy gradient estimations. However existing PG merging methods can be
sample inefficient and are not suitable to train deterministic policies
directly. To address these issues, this paper introduces elite PGs and
strengthens their variance reduction effect by adopting elitism and policy
consolidation techniques to regularize policy training based on policy
behavioral knowledge extracted from elite trajectories. Meanwhile, we propose a
two-step method to merge elite PGs and conventional PGs as a new extension of
the conventional interpolation merging method. At both the theoretical and
experimental levels, we show that both two-step merging and interpolation
merging can induce varied bias-variance tradeoffs during policy training. They
enable us to effectively use elite PGs and mitigate their performance impact on
trained policies. Our experiments also show that two-step merging can
outperform interpolation merging and several state-of-the-art algorithms on six
benchmark control tasks
A Convergence Result for Regularized Actor-Critic Methods
In this paper, we present a probability one convergence proof, under suitable
conditions, of a certain class of actor-critic algorithms for finding
approximate solutions to entropy-regularized MDPs using the machinery of
stochastic approximation. To obtain this overall result, we prove the
convergence of policy evaluation with general regularizers when using linear
approximation architectures and show convergence of entropy-regularized policy
improvement
- …