Search CORE

136 research outputs found

Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space

Author: Barakat Anas
Fatkhullin Ilyas
He Niao
Publication venue
Publication date: 02/06/2023
Field of study

We consider the reinforcement learning (RL) problem with general utilities which consists in maximizing a function of the state-action occupancy measure. Beyond the standard cumulative reward RL setting, this problem includes as particular cases constrained RL, pure exploration and learning from demonstrations among others. For this problem, we propose a simpler single-loop parameter-free normalized policy gradient algorithm. Implementing a recursive momentum variance reduction mechanism, our algorithm achieves

\tilde{\mathcal{O}}(\epsilon^{-3})

and

\tilde{\mathcal{O}}(\epsilon^{-2})

sample complexities for

\epsilon

-first-order stationarity and

\epsilon

-global optimality respectively, under adequate assumptions. We further address the setting of large finite state action spaces via linear function approximation of the occupancy measure and show a

\tilde{\mathcal{O}}(\epsilon^{-4})

sample complexity for a simple policy gradient method with a linear regression subroutine.Comment: 48 pages, 2 figures, ICML 2023, this paper was initially submitted in January 26th 202

arXiv.org e-Print Archive

Momentum Provably Improves Error Feedback!

Author: Fatkhullin Ilyas
Richtárik Peter
Tyurin Alexander
Publication venue
Publication date: 24/05/2023
Field of study

Due to the high communication overhead when training machine learning models in a distributed environment, modern algorithms invariably rely on lossy communication compression. However, when untreated, the errors caused by compression propagate, and can lead to severely unstable behavior, including exponential divergence. Almost a decade ago, Seide et al [2014] proposed an error feedback (EF) mechanism, which we refer to as EF14, as an immensely effective heuristic for mitigating this issue. However, despite steady algorithmic and theoretical advances in the EF field in the last decade, our understanding is far from complete. In this work we address one of the most pressing issues. In particular, in the canonical nonconvex setting, all known variants of EF rely on very large batch sizes to converge, which can be prohibitive in practice. We propose a surprisingly simple fix which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richt\'{a}rik et al. [2021] known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves the communication and sample complexities of previous error feedback algorithms under standard smoothness and bounded variance assumptions, and does not require any further strong assumptions such as bounded gradient dissimilarity. Moreover, we propose a double momentum version of our method that improves the complexities even further. Our proof seems to be novel even when compression is removed from the method, and as such, our proof technique is of independent interest in the study of nonconvex stochastic optimization enriched with Polyak's momentum

arXiv.org e-Print Archive

Sharp Analysis of Stochastic Optimization under Global Kurdyka-{\L}ojasiewicz Inequality

Author: Etesami Jalal
Fatkhullin Ilyas
He Niao
Kiyavash Negar
Publication venue
Publication date: 04/10/2022
Field of study

We study the complexity of finding the global solution to stochastic nonconvex optimization when the objective function satisfies global Kurdyka-Lojasiewicz (KL) inequality and the queries from stochastic gradient oracles satisfy mild expected smoothness assumption. We first introduce a general framework to analyze Stochastic Gradient Descent (SGD) and its associated nonlinear dynamics under the setting. As a byproduct of our analysis, we obtain a sample complexity of

\mathcal{O}(\epsilon^{-(4-\alpha)/\alpha})

for SGD when the objective satisfies the so called

\alpha

-PL condition, where

\alpha

is the degree of gradient domination. Furthermore, we show that a modified SGD with variance reduction and restarting (PAGER) achieves an improved sample complexity of

\mathcal{O}(\epsilon^{-2/\alpha})

when the objective satisfies the average smoothness assumption. This leads to the first optimal algorithm for the important case of

\alpha=1

which appears in applications such as policy optimization in reinforcement learning.Comment: The work was submitted for review in May, 2022 and was accepted to NeurIPS 2022 in Sep, 202

arXiv.org e-Print Archive

Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity

Author: Barakat Anas
Fatkhullin Ilyas
He Niao
Wu Jiduan
Publication venue
Publication date: 08/09/2023
Field of study

Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and can be used (i) as a dynamic game formulation for risk-sensitive or robust control, or (ii) as a benchmark setting for multi-agent reinforcement learning with two competing agents in continuous state-control spaces. In contrast to the well-studied single-agent linear quadratic regulator problem, zero-sum LQ games entail solving a challenging nonconvex-nonconcave min-max problem with an objective function that lacks coercivity. Recently, Zhang et al. discovered an implicit regularization property of natural policy gradient methods which is crucial for safety-critical control systems since it preserves the robustness of the controller during learning. Moreover, in the model-free setting where the knowledge of model parameters is not available, Zhang et al. proposed the first polynomial sample complexity algorithm to reach an

\epsilon

-neighborhood of the Nash equilibrium while maintaining the desirable implicit regularization property. In this work, we propose a simpler nested Zeroth-Order (ZO) algorithm improving sample complexity by several orders of magnitude. Our main result guarantees a

\widetilde{\mathcal{O}}(\epsilon^{-3})

sample complexity under the same assumptions using a single-point ZO estimator. Furthermore, when the estimator is replaced by a two-point estimator, our method enjoys a better

\widetilde{\mathcal{O}}(\epsilon^{-2})

sample complexity. Our key improvements rely on a more sample-efficient nested algorithm design and finer control of the ZO natural gradient estimation error

arXiv.org e-Print Archive