291 research outputs found

    Convex and Non-Convex Optimization under Generalized Smoothness

    Full text link
    Classical analysis of convex and non-convex optimization methods often requires the Lipshitzness of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.Comment: 39 page

    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise

    Full text link
    We consider the stochastic optimization problem with smooth but not necessarily convex objectives in the heavy-tailed noise regime, where the stochastic gradient's noise is assumed to have bounded ppth moment (p(1,2]p\in(1,2]). Zhang et al. (2020) is the first to prove the Ω(T1p3p2)\Omega(T^{\frac{1-p}{3p-2}}) lower bound for convergence (in expectation) and provides a simple clipping algorithm that matches this optimal rate. Cutkosky and Mehta (2021) proposes another algorithm, which is shown to achieve the nearly optimal high-probability convergence guarantee O(log(T/δ)T1p3p2)O(\log(T/\delta)T^{\frac{1-p}{3p-2}}), where δ\delta is the probability of failure. However, this desirable guarantee is only established under the additional assumption that the stochastic gradient itself is bounded in ppth moment, which fails to hold even for quadratic objectives and centered Gaussian noise. In this work, we first improve the analysis of the algorithm in Cutkosky and Mehta (2021) to obtain the same nearly optimal high-probability convergence rate O(log(T/δ)T1p3p2)O(\log(T/\delta)T^{\frac{1-p}{3p-2}}), without the above-mentioned restrictive assumption. Next, and curiously, we show that one can achieve a faster rate than that dictated by the lower bound Ω(T1p3p2)\Omega(T^{\frac{1-p}{3p-2}}) with only a tiny bit of structure, i.e., when the objective function F(x)F(x) is assumed to be in the form of EΞD[f(x,Ξ)]\mathbb{E}_{\Xi\sim\mathcal{D}}[f(x,\Xi)], arguably the most widely applicable class of stochastic optimization problems. For this class of problems, we propose the first variance-reduced accelerated algorithm and establish that it guarantees a high-probability convergence rate of O(log(T/δ)T1p2p1)O(\log(T/\delta)T^{\frac{1-p}{2p-1}}) under a mild condition, which is faster than Ω(T1p3p2)\Omega(T^{\frac{1-p}{3p-2}}). Notably, even when specialized to the finite-variance case, our result yields the (near-)optimal high-probability rate O(log(T/δ)T1/3)O(\log(T/\delta)T^{-1/3})

    Accelerated Zeroth-order Method for Non-Smooth Stochastic Convex Optimization Problem with Infinite Variance

    Full text link
    In this paper, we consider non-smooth stochastic convex optimization with two function evaluations per round under infinite noise variance. In the classical setting when noise has finite variance, an optimal algorithm, built upon the batched accelerated gradient method, was proposed in (Gasnikov et. al., 2022). This optimality is defined in terms of iteration and oracle complexity, as well as the maximal admissible level of adversarial noise. However, the assumption of finite variance is burdensome and it might not hold in many practical scenarios. To address this, we demonstrate how to adapt a refined clipped version of the accelerated gradient (Stochastic Similar Triangles) method from (Sadiev et al., 2023) for a two-point zero-order oracle. This adaptation entails extending the batching technique to accommodate infinite variance -- a non-trivial task that stands as a distinct contribution of this paper

    Smoothed Gradient Clipping and Error Feedback for Distributed Optimization under Heavy-Tailed Noise

    Full text link
    Motivated by understanding and analysis of large-scale machine learning under heavy-tailed gradient noise, we study distributed optimization with smoothed gradient clipping, i.e., in which certain smoothed clipping operators are applied to the gradients or gradient estimates computed from local clients prior to further processing. While vanilla gradient clipping has proven effective in mitigating the impact of heavy-tailed gradient noises in non-distributed setups, it incurs bias that causes convergence issues in heterogeneous distributed settings. To address the inherent bias introduced by gradient clipping, we develop a smoothed clipping operator, and propose a distributed gradient method equipped with an error feedback mechanism, i.e., the clipping operator is applied on the difference between some local gradient estimator and local stochastic gradient. We establish that, for the first time in the strongly convex setting with heavy-tailed gradient noises that may not have finite moments of order greater than one, the proposed distributed gradient method's mean square error (MSE) converges to zero at a rate O(1/tι)O(1/t^\iota), ι(0,0.4)\iota \in (0, 0.4), where the exponent ι\iota stays bounded away from zero as a function of the problem condition number and the first absolute moment of the noise. Numerical experiments validate our theoretical findings.Comment: 25 pages, 2 figure

    Near-Optimal High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise

    Full text link
    Thanks to their practical efficiency and random nature of the data, stochastic first-order methods are standard for training large-scale machine learning models. Random behavior may cause a particular run of an algorithm to result in a highly suboptimal objective value, whereas theoretical guarantees are usually proved for the expectation of the objective value. Thus, it is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth stochastic convex optimization have complexity bounds with the dependence on the confidence level that is either negative-power or logarithmic but under an additional assumption of sub-Gaussian (light-tailed) noise distribution that may not hold in practice, e.g., in several NLP tasks. In our paper, we resolve this issue and derive the first high-probability convergence results with logarithmic dependence on the confidence level for non-smooth convex stochastic optimization problems with non-sub-Gaussian (heavy-tailed) noise. To derive our results, we propose novel stepsize rules for two stochastic methods with gradient clipping. Moreover, our analysis works for generalized smooth objectives with H\"older-continuous gradients, and for both methods, we provide an extension for strongly convex problems. Finally, our results imply that the first (accelerated) method we consider also has optimal iteration and oracle complexity in all the regimes, and the second one is optimal in the non-smooth setting.Comment: 53 pages, 5 figures. arXiv admin note: text overlap with arXiv:2005.1078

    Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses

    Full text link
    We study differentially private (DP) stochastic optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data points may be extremely large. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz continuous over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it often leads to pessimistic excess risk bounds. In many practical problems, the worst-case (uniform) Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work provides near-optimal excess risk bounds that do not depend on the uniform Lipschitz parameter of the loss. Building on a recent line of work (Wang et al., 2020; Kamath et al., 2022), we assume that stochastic gradients have bounded kk-th order moments for some k2k \geq 2. Compared with works on uniformly Lipschitz DP SO, our excess risk scales with the kk-th moment bound instead of the uniform Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers and/or heavy-tailed data. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to (Wang et al., 2020; Kamath et al., 2022), our bounds do not require the loss function to be differentiable/smooth. We also devise a linear-time algorithm for smooth losses that has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-uniformly Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.Comment: Appeared in the International Conference on Algorithmic Learning Theory (ALT) 2023. This version improves the runtime bound in Theorem

    Efficient Private SCO for Heavy-Tailed Data via Clipping

    Full text link
    We consider stochastic convex optimization for heavy-tailed data with the guarantee of being differentially private (DP). Prior work on this problem is restricted to the gradient descent (GD) method, which is inefficient for large-scale problems. In this paper, we resolve this issue and derive the first high-probability bounds for the private stochastic method with clipping. For general convex problems, we derive excess population risks \Tilde{O}\left(\frac{d^{1/7}\sqrt{\ln\frac{(n \epsilon)^2}{\beta d}}}{(n\epsilon)^{2/7}}\right) and \Tilde{O}\left(\frac{d^{1/7}\ln\frac{(n\epsilon)^2}{\beta d}}{(n\epsilon)^{2/7}}\right) under bounded or unbounded domain assumption, respectively (here nn is the sample size, dd is the dimension of the data, β\beta is the confidence level and ϵ\epsilon is the private level). Then, we extend our analysis to the strongly convex case and non-smooth case (which works for generalized smooth objectives with Ho¨\ddot{\text{o}}lder-continuous gradients). We establish new excess risk bounds without bounded domain assumption. The results above achieve lower excess risks and gradient complexities than existing methods in their corresponding cases. Numerical experiments are conducted to justify the theoretical improvement

    Stochastic Nonsmooth Convex Optimization with Heavy-Tailed Noises

    Full text link
    Recently, several studies consider the stochastic optimization problem but in a heavy-tailed noise regime, i.e., the difference between the stochastic gradient and the true gradient is assumed to have a finite pp-th moment (say being upper bounded by σp\sigma^{p} for some σ0\sigma\geq0) where p(1,2]p\in(1,2], which not only generalizes the traditional finite variance assumption (p=2p=2) but also has been observed in practice for several different tasks. Under this challenging assumption, lots of new progress has been made for either convex or nonconvex problems, however, most of which only consider smooth objectives. In contrast, people have not fully explored and well understood this problem when functions are nonsmooth. This paper aims to fill this crucial gap by providing a comprehensive analysis of stochastic nonsmooth convex optimization with heavy-tailed noises. We revisit a simple clipping-based algorithm, whereas, which is only proved to converge in expectation but under the additional strong convexity assumption. Under appropriate choices of parameters, for both convex and strongly convex functions, we not only establish the first high-probability rates but also give refined in-expectation bounds compared with existing works. Remarkably, all of our results are optimal (or nearly optimal up to logarithmic factors) with respect to the time horizon TT even when TT is unknown in advance. Additionally, we show how to make the algorithm parameter-free with respect to σ\sigma, in other words, the algorithm can still guarantee convergence without any prior knowledge of σ\sigma
    corecore