291 research outputs found
Convex and Non-Convex Optimization under Generalized Smoothness
Classical analysis of convex and non-convex optimization methods often
requires the Lipshitzness of the gradient, which limits the analysis to
functions bounded by quadratics. Recent work relaxed this requirement to a
non-uniform smoothness condition with the Hessian norm bounded by an affine
function of the gradient norm, and proved convergence in the non-convex setting
via gradient clipping, assuming bounded noise. In this paper, we further
generalize this non-uniform smoothness condition and develop a simple, yet
powerful analysis technique that bounds the gradients along the trajectory,
thereby leading to stronger results for both convex and non-convex optimization
problems. In particular, we obtain the classical convergence rates for
(stochastic) gradient descent and Nesterov's accelerated gradient method in the
convex and/or non-convex setting under this general smoothness condition. The
new analysis approach does not require gradient clipping and allows
heavy-tailed noise with bounded variance in the stochastic setting.Comment: 39 page
Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise
We consider the stochastic optimization problem with smooth but not
necessarily convex objectives in the heavy-tailed noise regime, where the
stochastic gradient's noise is assumed to have bounded th moment
(). Zhang et al. (2020) is the first to prove the
lower bound for convergence (in expectation) and
provides a simple clipping algorithm that matches this optimal rate. Cutkosky
and Mehta (2021) proposes another algorithm, which is shown to achieve the
nearly optimal high-probability convergence guarantee
, where is the probability of
failure. However, this desirable guarantee is only established under the
additional assumption that the stochastic gradient itself is bounded in th
moment, which fails to hold even for quadratic objectives and centered Gaussian
noise.
In this work, we first improve the analysis of the algorithm in Cutkosky and
Mehta (2021) to obtain the same nearly optimal high-probability convergence
rate , without the above-mentioned
restrictive assumption. Next, and curiously, we show that one can achieve a
faster rate than that dictated by the lower bound
with only a tiny bit of structure, i.e., when
the objective function is assumed to be in the form of
, arguably the most widely
applicable class of stochastic optimization problems. For this class of
problems, we propose the first variance-reduced accelerated algorithm and
establish that it guarantees a high-probability convergence rate of
under a mild condition, which is faster
than . Notably, even when specialized to the
finite-variance case, our result yields the (near-)optimal high-probability
rate
Accelerated Zeroth-order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance
In this paper, we consider non-smooth stochastic convex optimization with two
function evaluations per round under infinite noise variance. In the classical
setting when noise has finite variance, an optimal algorithm, built upon the
batched accelerated gradient method, was proposed in (Gasnikov et. al., 2022).
This optimality is defined in terms of iteration and oracle complexity, as well
as the maximal admissible level of adversarial noise. However, the assumption
of finite variance is burdensome and it might not hold in many practical
scenarios. To address this, we demonstrate how to adapt a refined clipped
version of the accelerated gradient (Stochastic Similar Triangles) method from
(Sadiev et al., 2023) for a two-point zero-order oracle. This adaptation
entails extending the batching technique to accommodate infinite variance -- a
non-trivial task that stands as a distinct contribution of this paper
Smoothed Gradient Clipping and Error Feedback for Distributed Optimization under Heavy-Tailed Noise
Motivated by understanding and analysis of large-scale machine learning under
heavy-tailed gradient noise, we study distributed optimization with smoothed
gradient clipping, i.e., in which certain smoothed clipping operators are
applied to the gradients or gradient estimates computed from local clients
prior to further processing. While vanilla gradient clipping has proven
effective in mitigating the impact of heavy-tailed gradient noises in
non-distributed setups, it incurs bias that causes convergence issues in
heterogeneous distributed settings. To address the inherent bias introduced by
gradient clipping, we develop a smoothed clipping operator, and propose a
distributed gradient method equipped with an error feedback mechanism, i.e.,
the clipping operator is applied on the difference between some local gradient
estimator and local stochastic gradient. We establish that, for the first time
in the strongly convex setting with heavy-tailed gradient noises that may not
have finite moments of order greater than one, the proposed distributed
gradient method's mean square error (MSE) converges to zero at a rate
, , where the exponent stays bounded
away from zero as a function of the problem condition number and the first
absolute moment of the noise. Numerical experiments validate our theoretical
findings.Comment: 25 pages, 2 figure
Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise
Thanks to their practical efficiency and random nature of the data,
stochastic first-order methods are standard for training large-scale machine
learning models. Random behavior may cause a particular run of an algorithm to
result in a highly suboptimal objective value, whereas theoretical guarantees
are usually proved for the expectation of the objective value. Thus, it is
essential to theoretically guarantee that algorithms provide small objective
residual with high probability. Existing methods for non-smooth stochastic
convex optimization have complexity bounds with the dependence on the
confidence level that is either negative-power or logarithmic but under an
additional assumption of sub-Gaussian (light-tailed) noise distribution that
may not hold in practice, e.g., in several NLP tasks. In our paper, we resolve
this issue and derive the first high-probability convergence results with
logarithmic dependence on the confidence level for non-smooth convex stochastic
optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our
results, we propose novel stepsize rules for two stochastic methods with
gradient clipping. Moreover, our analysis works for generalized smooth
objectives with H\"older-continuous gradients, and for both methods, we provide
an extension for strongly convex problems. Finally, our results imply that the
first (accelerated) method we consider also has optimal iteration and oracle
complexity in all the regimes, and the second one is optimal in the non-smooth
setting.Comment: 53 pages, 5 figures. arXiv admin note: text overlap with
arXiv:2005.1078
Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses
We study differentially private (DP) stochastic optimization (SO) with loss
functions whose worst-case Lipschitz parameter over all data points may be
extremely large. To date, the vast majority of work on DP SO assumes that the
loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are
uniformly bounded over all data points). While this assumption is convenient,
it often leads to pessimistic excess risk bounds. In many practical problems,
the worst-case (uniform) Lipschitz parameter of the loss over all data points
may be extremely large due to outliers. In such cases, the error bounds for DP
SO, which scale with the worst-case Lipschitz parameter of the loss, are
vacuous. To address these limitations, this work provides near-optimal excess
risk bounds that do not depend on the uniform Lipschitz parameter of the loss.
Building on a recent line of work (Wang et al., 2020; Kamath et al., 2022), we
assume that stochastic gradients have bounded -th order moments for some . Compared with works on uniformly Lipschitz DP SO, our excess risk
scales with the -th moment bound instead of the uniform Lipschitz parameter
of the loss, allowing for significantly faster rates in the presence of
outliers and/or heavy-tailed data. For convex and strongly convex loss
functions, we provide the first asymptotically optimal excess risk bounds (up
to a logarithmic factor). In contrast to (Wang et al., 2020; Kamath et al.,
2022), our bounds do not require the loss function to be differentiable/smooth.
We also devise a linear-time algorithm for smooth losses that has excess risk
that is tight in certain practical parameter regimes. Additionally, our work is
the first to address non-convex non-uniformly Lipschitz loss functions
satisfying the Proximal-PL inequality; this covers some practical machine
learning models. Our Proximal-PL algorithm has near-optimal excess risk.Comment: Appeared in the International Conference on Algorithmic Learning
Theory (ALT) 2023. This version improves the runtime bound in Theorem
Efficient Private SCO for Heavy-Tailed Data via Clipping
We consider stochastic convex optimization for heavy-tailed data with the
guarantee of being differentially private (DP). Prior work on this problem is
restricted to the gradient descent (GD) method, which is inefficient for
large-scale problems. In this paper, we resolve this issue and derive the first
high-probability bounds for the private stochastic method with clipping. For
general convex problems, we derive excess population risks
\Tilde{O}\left(\frac{d^{1/7}\sqrt{\ln\frac{(n \epsilon)^2}{\beta
d}}}{(n\epsilon)^{2/7}}\right) and
\Tilde{O}\left(\frac{d^{1/7}\ln\frac{(n\epsilon)^2}{\beta
d}}{(n\epsilon)^{2/7}}\right) under bounded or unbounded domain assumption,
respectively (here is the sample size, is the dimension of the data,
is the confidence level and is the private level). Then, we
extend our analysis to the strongly convex case and non-smooth case (which
works for generalized smooth objectives with Hlder-continuous
gradients). We establish new excess risk bounds without bounded domain
assumption. The results above achieve lower excess risks and gradient
complexities than existing methods in their corresponding cases. Numerical
experiments are conducted to justify the theoretical improvement
Stochastic Nonsmooth Convex Optimization with Heavy-Tailed Noises
Recently, several studies consider the stochastic optimization problem but in
a heavy-tailed noise regime, i.e., the difference between the stochastic
gradient and the true gradient is assumed to have a finite -th moment (say
being upper bounded by for some ) where ,
which not only generalizes the traditional finite variance assumption ()
but also has been observed in practice for several different tasks. Under this
challenging assumption, lots of new progress has been made for either convex or
nonconvex problems, however, most of which only consider smooth objectives. In
contrast, people have not fully explored and well understood this problem when
functions are nonsmooth. This paper aims to fill this crucial gap by providing
a comprehensive analysis of stochastic nonsmooth convex optimization with
heavy-tailed noises. We revisit a simple clipping-based algorithm, whereas,
which is only proved to converge in expectation but under the additional strong
convexity assumption. Under appropriate choices of parameters, for both convex
and strongly convex functions, we not only establish the first high-probability
rates but also give refined in-expectation bounds compared with existing works.
Remarkably, all of our results are optimal (or nearly optimal up to logarithmic
factors) with respect to the time horizon even when is unknown in
advance. Additionally, we show how to make the algorithm parameter-free with
respect to , in other words, the algorithm can still guarantee
convergence without any prior knowledge of
- …