381 research outputs found
Accelerated Policy Gradient: On the Nesterov Momentum for Reinforcement Learning
Policy gradient methods have recently been shown to enjoy global convergence
at a rate in the non-regularized tabular softmax setting.
Accordingly, one important research question is whether this convergence rate
can be further improved, with only first-order updates. In this paper, we
answer the above question from the perspective of momentum by adapting the
celebrated Nesterov's accelerated gradient (NAG) method to reinforcement
learning (RL), termed \textit{Accelerated Policy Gradient} (APG). To
demonstrate the potential of APG in achieving faster global convergence, we
formally show that with the true gradient, APG with softmax policy
parametrization converges to an optimal policy at a rate. To
the best of our knowledge, this is the first characterization of the global
convergence rate of NAG in the context of RL. Notably, our analysis relies on
one interesting finding: Regardless of the initialization, APG could end up
reaching a locally nearly-concave regime, where APG could benefit significantly
from the momentum, within finite iterations. By means of numerical validation,
we confirm that APG exhibits rate as well as show that APG
could significantly improve the convergence behavior over the standard policy
gradient.Comment: 51 pages, 8 figure
Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees
We revisit the domain of off-policy policy optimization in RL from the
perspective of coordinate ascent. One commonly-used approach is to leverage the
off-policy policy gradient to optimize a surrogate objective -- the total
discounted in expectation return of the target policy with respect to the state
distribution of the behavior policy. However, this approach has been shown to
suffer from the distribution mismatch issue, and therefore significant efforts
are needed for correcting this mismatch either via state distribution
correction or a counterfactual method. In this paper, we rethink off-policy
learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy
actor-critic algorithm that decouples policy improvement from the state
distribution of the behavior policy without using the policy gradient. This
design obviates the need for distribution correction or importance sampling in
the policy improvement step of off-policy policy gradient. We establish the
global convergence of CAPO with general coordinate selection and then further
quantify the convergence rates of several instances of CAPO with popular
coordinate selection rules, including the cyclic and the randomized variants of
CAPO. We then extend CAPO to neural policies for a more practical
implementation. Through experiments, we demonstrate that CAPO provides a
competitive approach to RL in practice.Comment: 47 pages, 4 figure
Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits
Modifying the reward-biased maximum likelihood method originally proposed in
the adaptive control literature, we propose novel learning algorithms to handle
the explore-exploit trade-off in linear bandits problems as well as generalized
linear bandits problems. We develop novel index policies that we prove achieve
order-optimality, and show that they achieve empirical performance competitive
with the state-of-the-art benchmark methods in extensive experiments. The new
policies achieve this with low computation time per pull for linear bandits,
and thereby resulting in both favorable regret as well as computational
efficiency
Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs
We consider the infinite-horizon linear Markov Decision Processes (MDPs),
where the transition probabilities of the dynamic model can be linearly
parameterized with the help of a predefined low-dimensional feature mapping.
While the existing regression-based approaches have been theoretically shown to
achieve nearly-optimal regret, they are computationally rather inefficient due
to the need for a large number of optimization runs in each time step,
especially when the state and action spaces are large. To address this issue,
we propose to solve linear MDPs through the lens of Value-Biased Maximum
Likelihood Estimation (VBMLE), which is a classic model-based exploration
principle in the adaptive control literature for resolving the well-known
closed-loop identification problem of Maximum Likelihood Estimation. We
formally show that (i) VBMLE enjoys regret, where
is the time horizon and is the dimension of the model parameter, and
(ii) VBMLE is computationally more efficient as it only requires solving one
optimization problem in each time step. In our regret analysis, we offer a
generic convergence result of MLE in linear MDPs through a novel
supermartingale construct and uncover an interesting connection between linear
MDPs and online learning, which could be of independent interest. Finally, the
simulation results show that VBMLE significantly outperforms the benchmark
method in terms of both empirical regret and computation time
Image Deraining via Self-supervised Reinforcement Learning
The quality of images captured outdoors is often affected by the weather. One
factor that interferes with sight is rain, which can obstruct the view of
observers and computer vision applications that rely on those images. The work
aims to recover rain images by removing rain streaks via Self-supervised
Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain
streak pixels from the input rain image via dictionary learning and use
pixel-wise RL agents to take multiple inpainting actions to remove rain
progressively. To our knowledge, this work is the first attempt where
self-supervised RL is applied to image deraining. Experimental results on
several benchmark image-deraining datasets show that the proposed SRL-Derain
performs favorably against state-of-the-art few-shot and self-supervised
deraining and denoising methods
- …