139 research outputs found
Taming Nonconvex Stochastic Mirror Descent with General Bregman Divergence
This paper revisits the convergence of Stochastic Mirror Descent (SMD) in the
contemporary nonconvex optimization setting. Existing results for batch-free
nonconvex SMD restrict the choice of the distance generating function (DGF) to
be differentiable with Lipschitz continuous gradients, thereby excluding
important setups such as Shannon entropy. In this work, we present a new
convergence analysis of nonconvex SMD supporting general DGF, that overcomes
the above limitations and relies solely on the standard assumptions. Moreover,
our convergence is established with respect to the Bregman Forward-Backward
envelope, which is a stronger measure than the commonly used squared norm of
gradient mapping. We further extend our results to guarantee high probability
convergence under sub-Gaussian noise and global convergence under the
generalized Bregman Proximal Polyak-{\L}ojasiewicz condition. Additionally, we
illustrate the advantages of our improved SMD theory in various nonconvex
machine learning tasks by harnessing nonsmooth DGFs. Notably, in the context of
nonconvex differentially private (DP) learning, our theory yields a simple
algorithm with a (nearly) dimension-independent utility bound. For the problem
of training linear neural networks, we develop provably convergent stochastic
algorithms.Comment: Accepted for publication at AISTATS 202
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space
We consider the reinforcement learning (RL) problem with general utilities
which consists in maximizing a function of the state-action occupancy measure.
Beyond the standard cumulative reward RL setting, this problem includes as
particular cases constrained RL, pure exploration and learning from
demonstrations among others. For this problem, we propose a simpler single-loop
parameter-free normalized policy gradient algorithm. Implementing a recursive
momentum variance reduction mechanism, our algorithm achieves
and
sample complexities for -first-order stationarity and
-global optimality respectively, under adequate assumptions. We
further address the setting of large finite state action spaces via linear
function approximation of the occupancy measure and show a
sample complexity for a simple policy
gradient method with a linear regression subroutine.Comment: 48 pages, 2 figures, ICML 2023, this paper was initially submitted in
January 26th 202
Stochastic Optimization under Hidden Convexity
In this work, we consider constrained stochastic optimization problems under
hidden convexity, i.e., those that admit a convex reformulation via non-linear
(but invertible) map . A number of non-convex problems ranging from
optimal control, revenue and inventory management, to convex reinforcement
learning all admit such a hidden convex structure. Unfortunately, in the
majority of applications considered, the map is unavailable or
implicit; therefore, directly solving the convex reformulation is not possible.
On the other hand, the stochastic gradients with respect to the original
variable are often easy to obtain. Motivated by these observations, we examine
the basic projected stochastic (sub-) gradient methods for solving such
problems under hidden convexity. We provide the first sample complexity
guarantees for global convergence in smooth and non-smooth settings.
Additionally, in the smooth setting, we improve our results to the last iterate
convergence in terms of function value gap using the momentum variant of
projected stochastic gradient descent
Momentum Provably Improves Error Feedback!
Due to the high communication overhead when training machine learning models
in a distributed environment, modern algorithms invariably rely on lossy
communication compression. However, when untreated, the errors caused by
compression propagate, and can lead to severely unstable behavior, including
exponential divergence. Almost a decade ago, Seide et al [2014] proposed an
error feedback (EF) mechanism, which we refer to as EF14, as an immensely
effective heuristic for mitigating this issue. However, despite steady
algorithmic and theoretical advances in the EF field in the last decade, our
understanding is far from complete. In this work we address one of the most
pressing issues. In particular, in the canonical nonconvex setting, all known
variants of EF rely on very large batch sizes to converge, which can be
prohibitive in practice. We propose a surprisingly simple fix which removes
this issue both theoretically, and in practice: the application of Polyak's
momentum to the latest incarnation of EF due to Richt\'{a}rik et al. [2021]
known as EF21. Our algorithm, for which we coin the name EF21-SGDM, improves
the communication and sample complexities of previous error feedback algorithms
under standard smoothness and bounded variance assumptions, and does not
require any further strong assumptions such as bounded gradient dissimilarity.
Moreover, we propose a double momentum version of our method that improves the
complexities even further. Our proof seems to be novel even when compression is
removed from the method, and as such, our proof technique is of independent
interest in the study of nonconvex stochastic optimization enriched with
Polyak's momentum
Sharp Analysis of Stochastic Optimization under Global Kurdyka-{\L}ojasiewicz Inequality
We study the complexity of finding the global solution to stochastic
nonconvex optimization when the objective function satisfies global
Kurdyka-Lojasiewicz (KL) inequality and the queries from stochastic gradient
oracles satisfy mild expected smoothness assumption. We first introduce a
general framework to analyze Stochastic Gradient Descent (SGD) and its
associated nonlinear dynamics under the setting. As a byproduct of our
analysis, we obtain a sample complexity of
for SGD when the objective
satisfies the so called -PL condition, where is the degree of
gradient domination. Furthermore, we show that a modified SGD with variance
reduction and restarting (PAGER) achieves an improved sample complexity of
when the objective satisfies the average
smoothness assumption. This leads to the first optimal algorithm for the
important case of which appears in applications such as policy
optimization in reinforcement learning.Comment: The work was submitted for review in May, 2022 and was accepted to
NeurIPS 2022 in Sep, 202
Learning Zero-Sum Linear Quadratic Games with Improved Sample Complexity
Zero-sum Linear Quadratic (LQ) games are fundamental in optimal control and
can be used (i) as a dynamic game formulation for risk-sensitive or robust
control, or (ii) as a benchmark setting for multi-agent reinforcement learning
with two competing agents in continuous state-control spaces. In contrast to
the well-studied single-agent linear quadratic regulator problem, zero-sum LQ
games entail solving a challenging nonconvex-nonconcave min-max problem with an
objective function that lacks coercivity. Recently, Zhang et al. discovered an
implicit regularization property of natural policy gradient methods which is
crucial for safety-critical control systems since it preserves the robustness
of the controller during learning. Moreover, in the model-free setting where
the knowledge of model parameters is not available, Zhang et al. proposed the
first polynomial sample complexity algorithm to reach an
-neighborhood of the Nash equilibrium while maintaining the desirable
implicit regularization property. In this work, we propose a simpler nested
Zeroth-Order (ZO) algorithm improving sample complexity by several orders of
magnitude. Our main result guarantees a
sample complexity under the same
assumptions using a single-point ZO estimator. Furthermore, when the estimator
is replaced by a two-point estimator, our method enjoys a better
sample complexity. Our key
improvements rely on a more sample-efficient nested algorithm design and finer
control of the ZO natural gradient estimation error
- …