Search CORE

381 research outputs found

Accelerated Policy Gradient: On the Nesterov Momentum for Reinforcement Learning

Author: Chen Yen-Ju
Hsieh Ping-Chun
Huang Nai-Chieh
Publication venue
Publication date: 18/10/2023
Field of study

Policy gradient methods have recently been shown to enjoy global convergence at a

\Theta(1/t)

rate in the non-regularized tabular softmax setting. Accordingly, one important research question is whether this convergence rate can be further improved, with only first-order updates. In this paper, we answer the above question from the perspective of momentum by adapting the celebrated Nesterov's accelerated gradient (NAG) method to reinforcement learning (RL), termed \textit{Accelerated Policy Gradient} (APG). To demonstrate the potential of APG in achieving faster global convergence, we formally show that with the true gradient, APG with softmax policy parametrization converges to an optimal policy at a

\tilde{O}(1/t^2)

rate. To the best of our knowledge, this is the first characterization of the global convergence rate of NAG in the context of RL. Notably, our analysis relies on one interesting finding: Regardless of the initialization, APG could end up reaching a locally nearly-concave regime, where APG could benefit significantly from the momentum, within finite iterations. By means of numerical validation, we confirm that APG exhibits

\tilde{O}(1/t^2)

rate as well as show that APG could significantly improve the convergence behavior over the standard policy gradient.Comment: 51 pages, 8 figure

arXiv.org e-Print Archive

Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees

Author: Chen Yen-Ju
Hsieh Ping-Chun
Liu Xi
Su Hsin-En
Publication venue
Publication date: 10/12/2022
Field of study

We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance sampling in the policy improvement step of off-policy policy gradient. We establish the global convergence of CAPO with general coordinate selection and then further quantify the convergence rates of several instances of CAPO with popular coordinate selection rules, including the cyclic and the randomized variants of CAPO. We then extend CAPO to neural policies for a more practical implementation. Through experiments, we demonstrate that CAPO provides a competitive approach to RL in practice.Comment: 47 pages, 4 figure

arXiv.org e-Print Archive

Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Author: Hsieh Ping-Chun
Hung Yu-Heng
Kumar P. R.
Liu Xi
Publication venue
Publication date: 08/10/2020
Field of study

Modifying the reward-biased maximum likelihood method originally proposed in the adaptive control literature, we propose novel learning algorithms to handle the explore-exploit trade-off in linear bandits problems as well as generalized linear bandits problems. We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods in extensive experiments. The new policies achieve this with low computation time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs

Author: Hsieh Ping-Chun
Hung Yu-Heng
Kumar P. R.
Mete Akshay
Publication venue
Publication date: 17/10/2023
Field of study

We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition probabilities of the dynamic model can be linearly parameterized with the help of a predefined low-dimensional feature mapping. While the existing regression-based approaches have been theoretically shown to achieve nearly-optimal regret, they are computationally rather inefficient due to the need for a large number of optimization runs in each time step, especially when the state and action spaces are large. To address this issue, we propose to solve linear MDPs through the lens of Value-Biased Maximum Likelihood Estimation (VBMLE), which is a classic model-based exploration principle in the adaptive control literature for resolving the well-known closed-loop identification problem of Maximum Likelihood Estimation. We formally show that (i) VBMLE enjoys

\widetilde{O}(d\sqrt{T})

regret, where

T

is the time horizon and

d

is the dimension of the model parameter, and (ii) VBMLE is computationally more efficient as it only requires solving one optimization problem in each time step. In our regret analysis, we offer a generic convergence result of MLE in linear MDPs through a novel supermartingale construct and uncover an interesting connection between linear MDPs and online learning, which could be of independent interest. Finally, the simulation results show that VBMLE significantly outperforms the benchmark method in terms of both empirical regret and computation time

arXiv.org e-Print Archive

Image Deraining via Self-supervised Reinforcement Learning

Author: Chu Wen-Tao
Hsieh Ping-Chun
Liao He-Hao
Peng Yan-Tsung
Tsai Chung-Chi
Publication venue
Publication date: 27/03/2024
Field of study

The quality of images captured outdoors is often affected by the weather. One factor that interferes with sight is rain, which can obstruct the view of observers and computer vision applications that rely on those images. The work aims to recover rain images by removing rain streaks via Self-supervised Reinforcement Learning (RL) for image deraining (SRL-Derain). We locate rain streak pixels from the input rain image via dictionary learning and use pixel-wise RL agents to take multiple inpainting actions to remove rain progressively. To our knowledge, this work is the first attempt where self-supervised RL is applied to image deraining. Experimental results on several benchmark image-deraining datasets show that the proposed SRL-Derain performs favorably against state-of-the-art few-shot and self-supervised deraining and denoising methods

arXiv.org e-Print Archive