1,659 research outputs found
Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms
The actor-critic (AC) algorithm is a popular method to find an optimal policy
in reinforcement learning. In the infinite horizon scenario, the finite-sample
convergence rate for the AC and natural actor-critic (NAC) algorithms has been
established recently, but under independent and identically distributed
(i.i.d.) sampling and single-sample update at each iteration. In contrast, this
paper characterizes the convergence rate and sample complexity of AC and NAC
under Markovian sampling, with mini-batch data for each iteration, and with
actor having general policy class approximation. We show that the overall
sample complexity for a mini-batch AC to attain an -accurate
stationary point improves the best known sample complexity of AC by an order of
, and the overall sample complexity
for a mini-batch NAC to attain an -accurate globally optimal point
improves the existing sample complexity of NAC by an order of
. Moreover, the sample complexity
of AC and NAC characterized in this work outperforms that of policy gradient
(PG) and natural policy gradient (NPG) by a factor of
and
, respectively.
This is the first theoretical study establishing that AC and NAC attain
orderwise performance improvement over PG and NPG under infinite horizon due to
the incorporation of critic.Comment: Accepted by NeurIPS 202
Softmax Policy Gradient Methods Can Take Exponential Time to Converge
The softmax policy gradient (PG) method, which performs gradient ascent under
softmax policy parameterization, is arguably one of the de facto
implementations of policy optimization in modern reinforcement learning. For
-discounted infinite-horizon tabular Markov decision processes (MDPs),
remarkable progress has recently been achieved towards establishing global
convergence of softmax PG methods in finding a near-optimal policy. However,
prior results fall short of delineating clear dependencies of convergence rates
on salient parameters such as the cardinality of the state space
and the effective horizon , both of which could be
excessively large. In this paper, we deliver a pessimistic message regarding
the iteration complexity of softmax PG methods, despite assuming access to
exact gradient computation. Specifically, we demonstrate that the softmax PG
method with stepsize can take to converge, even in the presence of a benign policy
initialization and an initial state distribution amenable to exploration (so
that the distribution mismatch coefficient is not exceedingly large). This is
accomplished by characterizing the algorithmic dynamics over a
carefully-constructed MDP containing only three actions. Our exponential lower
bound hints at the necessity of carefully adjusting update rules or enforcing
proper regularization in accelerating PG methods
Scalable Bilinear Learning Using State and Action Features
Approximate linear programming (ALP) represents one of the major algorithmic
families to solve large-scale Markov decision processes (MDP). In this work, we
study a primal-dual formulation of the ALP, and develop a scalable, model-free
algorithm called bilinear learning for reinforcement learning when a
sampling oracle is provided. This algorithm enjoys a number of advantages.
First, it adopts (bi)linear models to represent the high-dimensional value
function and state-action distributions, using given state and action features.
Its run-time complexity depends on the number of features, not the size of the
underlying MDPs. Second, it operates in a fully online fashion without having
to store any sample, thus having minimal memory footprint. Third, we prove that
it is sample-efficient, solving for the optimal policy to high precision with a
sample complexity linear in the dimension of the parameter space
A Tour of Reinforcement Learning: The View from Continuous Control
This manuscript surveys reinforcement learning from the perspective of
optimization and control with a focus on continuous control applications. It
surveys the general formulation, terminology, and typical experimental
implementations of reinforcement learning and reviews competing solution
paradigms. In order to compare the relative merits of various techniques, this
survey presents a case study of the Linear Quadratic Regulator (LQR) with
unknown dynamics, perhaps the simplest and best-studied problem in optimal
control. The manuscript describes how merging techniques from learning theory
and control can provide non-asymptotic characterizations of LQR performance and
shows that these characterizations tend to match experimental behavior. In
turn, when revisiting more complex applications, many of the observed phenomena
in LQR persist. In particular, theory and experiment demonstrate the role and
importance of models and the cost of generality in reinforcement learning
algorithms. This survey concludes with a discussion of some of the challenges
in designing learning systems that safely and reliably interact with complex
and uncertain environments and how tools from reinforcement learning and
control might be combined to approach these challenges.Comment: minor revision with a few clarifying passages and corrected typo
Batch Policy Learning under Constraints
When learning policies for real-world domains, two important questions arise:
(i) how to efficiently use pre-collected off-policy, non-optimal behavior data;
and (ii) how to mediate among different competing objectives and constraints.
We thus study the problem of batch policy learning under multiple constraints,
and offer a systematic solution. We first propose a flexible meta-algorithm
that admits any batch reinforcement learning and online learning procedure as
subroutines. We then present a specific algorithmic instantiation and provide
performance guarantees for the main objective and all constraints. To certify
constraint satisfaction, we propose a new and simple method for off-policy
policy evaluation (OPE) and derive PAC-style bounds. Our algorithm achieves
strong empirical results in different domains, including in a challenging
problem of simulated car driving subject to multiple constraints such as lane
keeping and smooth driving. We also show experimentally that our OPE method
outperforms other popular OPE techniques on a standalone basis, especially in a
high-dimensional setting
Learning Efficient Representations for Reinforcement Learning
Markov decision processes (MDPs) are a well studied framework for solving
sequential decision making problems under uncertainty. Exact methods for
solving MDPs based on dynamic programming such as policy iteration and value
iteration are effective on small problems. In problems with a large discrete
state space or with continuous state spaces, a compact representation is
essential for providing an efficient approximation solutions to MDPs. Commonly
used approximation algorithms involving constructing basis functions for
projecting the value function onto a low dimensional subspace, and building a
factored or hierarchical graphical model to decompose the transition and reward
functions. However, hand-coding a good compact representation for a given
reinforcement learning (RL) task can be quite difficult and time consuming.
Recent approaches have attempted to automatically discover efficient
representations for RL.
In this thesis proposal, we discuss the problems of automatically
constructing structured kernel for kernel based RL, a popular approach to
learning non-parametric approximations for value function. We explore a space
of kernel structures which are built compositionally from base kernels using a
context-free grammar. We examine a greedy algorithm for searching over the
structure space. To demonstrate how the learned structure can represent and
approximate the original RL problem in terms of compactness and efficiency, we
plan to evaluate our method on a synthetic problem and compare it to other RL
baselines
Stochastic Policy Gradient Ascent in Reproducing Kernel Hilbert Spaces
Reinforcement learning consists of finding policies that maximize an expected
cumulative long-term reward in a Markov decision process with unknown
transition probabilities and instantaneous rewards. In this paper, we consider
the problem of finding such optimal policies while assuming they are continuous
functions belonging to a reproducing kernel Hilbert space (RKHS). To learn the
optimal policy we introduce a stochastic policy gradient ascent algorithm with
three unique novel features: (i) The stochastic estimates of policy gradients
are unbiased. (ii) The variance of stochastic gradients is reduced by drawing
on ideas from numerical differentiation. (iii) Policy complexity is controlled
using sparse RKHS representations. Novel feature (i) is instrumental in proving
convergence to a stationary point of the expected cumulative reward. Novel
feature (ii) facilitates reasonable convergence times. Novel feature (iii) is a
necessity in practical implementations which we show can be done in a way that
does not eliminate convergence guarantees. Numerical examples in standard
problems illustrate successful learning of policies with low complexity
representations which are close to stationary points of the expected cumulative
reward
Simultaneous Perturbation Algorithms for Batch Off-Policy Search
We propose novel policy search algorithms in the context of off-policy, batch
mode reinforcement learning (RL) with continuous state and action spaces. Given
a batch collection of trajectories, we perform off-line policy evaluation using
an algorithm similar to that by [Fonteneau et al., 2010]. Using this
Monte-Carlo like policy evaluator, we perform policy search in a class of
parameterized policies. We propose both first order policy gradient and second
order policy Newton algorithms. All our algorithms incorporate simultaneous
perturbation estimates for the gradient as well as the Hessian of the
cost-to-go vector, since the latter is unknown and only biased estimates are
available. We demonstrate their practicality on a simple 1-dimensional
continuous state space problem
The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint
The effectiveness of model-based versus model-free methods is a long-standing
question in reinforcement learning (RL). Motivated by recent empirical success
of RL on continuous control tasks, we study the sample complexity of popular
model-based and model-free algorithms on the Linear Quadratic Regulator (LQR).
We show that for policy evaluation, a simple model-based plugin method requires
asymptotically less samples than the classical least-squares temporal
difference (LSTD) estimator to reach the same quality of solution; the sample
complexity gap between the two methods can be at least a factor of state
dimension. For policy evaluation, we study a simple family of problem instances
and show that nominal (certainty equivalence principle) control also requires
several factors of state and input dimension fewer samples than the policy
gradient method to reach the same level of control performance on these
instances. Furthermore, the gap persists even when employing commonly used
baselines. To the best of our knowledge, this is the first theoretical result
which demonstrates a separation in the sample complexity between model-based
and model-free methods on a continuous control task.Comment: Improved the main result regarding policy optimizatio
Riemannian Proximal Policy Optimization
In this paper, We propose a general Riemannian proximal optimization
algorithm with guaranteed convergence to solve Markov decision process (MDP)
problems. To model policy functions in MDP, we employ Gaussian mixture model
(GMM) and formulate it as a nonconvex optimization problem in the Riemannian
space of positive semidefinite matrices. For two given policy functions, we
also provide its lower bound on policy improvement by using bounds derived from
the Wasserstein distance of GMMs. Preliminary experiments show the efficacy of
our proposed Riemannian proximal policy optimization algorithm.Comment: 12 pages, 1 figure
- …