289 research outputs found

    Accelerate Stochastic Subgradient Method by Leveraging Local Growth Condition

    Full text link
    In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function F(w)F(\mathbf w) in the ϵ\epsilon-sublevel set grows as fast as ∥w−w∗∥21/θ\|\mathbf w - \mathbf w_*\|_2^{1/\theta}, where w∗\mathbf w_* represents the closest optimal solution to w\mathbf w and θ∈(0,1]\theta\in(0,1] quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an ϵ\epsilon-optimal solution can be O~(1/ϵ2(1−θ))\widetilde O(1/\epsilon^{2(1-\theta)}), which is optimal at most up to a logarithmic factor. To achieve the faster global convergence, we develop two different accelerated stochastic subgradient methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also includes new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate θ\theta; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. We also characterize the complexity of the proposed algorithms for ensuring the gradient is small without the smoothness assumption

    Fast Rates of ERM and Stochastic Approximation: Adaptive to Error Bound Conditions

    Full text link
    Error bound conditions (EBC) are properties that characterize the growth of an objective function when a point is moved away from the optimal set. They have recently received increasing attention in the field of optimization for developing optimization algorithms with fast convergence. However, the studies of EBC in statistical learning are hitherto still limited. The main contributions of this paper are two-fold. First, we develop fast and intermediate rates of empirical risk minimization (ERM) under EBC for risk minimization with Lipschitz continuous, and smooth convex random functions. Second, we establish fast and intermediate rates of an efficient stochastic approximation (SA) algorithm for risk minimization with Lipschitz continuous random functions, which requires only one pass of nn samples and adapts to EBC. For both approaches, the convergence rates span a full spectrum between O~(1/n)\widetilde O(1/\sqrt{n}) and O~(1/n)\widetilde O(1/n) depending on the power constant in EBC, and could be even faster than O(1/n)O(1/n) in special cases for ERM. Moreover, these convergence rates are automatically adaptive without using any knowledge of EBC. Overall, this work not only strengthens the understanding of ERM for statistical learning but also brings new fast stochastic algorithms for solving a broad range of statistical learning problems

    Faster Subgradient Methods for Functions with H\"olderian Growth

    Full text link
    The purpose of this manuscript is to derive new convergence results for several subgradient methods applied to minimizing nonsmooth convex functions with H\"olderian growth. The growth condition is satisfied in many applications and includes functions with quadratic growth and weakly sharp minima as special cases. To this end there are three main contributions. First, for a constant and sufficiently small stepsize, we show that the subgradient method achieves linear convergence up to a certain region including the optimal set, with error of the order of the stepsize. Second, if appropriate problem parameters are known, we derive a decaying stepsize which obtains a much faster convergence rate than is suggested by the classical O(1/k)O(1/\sqrt{k}) result for the subgradient method. Thirdly we develop a novel "descending stairs" stepsize which obtains this faster convergence rate and also obtains linear convergence for the special case of weakly sharp functions. We also develop an adaptive variant of the "descending stairs" stepsize which achieves the same convergence rate without requiring an error bound constant which is difficult to estimate in practice.Comment: 50 pages. First revised version (under submission to Math Programming

    Learn-and-Adapt Stochastic Dual Gradients for Network Resource Allocation

    Full text link
    Network resource allocation shows revived popularity in the era of data deluge and information explosion. Existing stochastic optimization approaches fall short in attaining a desirable cost-delay tradeoff. Recognizing the central role of Lagrange multipliers in network resource allocation, a novel learn-and-adapt stochastic dual gradient (LA-SDG) method is developed in this paper to learn the sample-optimal Lagrange multiplier from historical data, and accordingly adapt the upcoming resource allocation strategy. Remarkably, LA-SDG only requires just an extra sample (gradient) evaluation relative to the celebrated stochastic dual gradient (SDG) method. LA-SDG can be interpreted as a foresighted learning scheme with an eye on the future, or, a modified heavy-ball iteration from an optimization viewpoint. It is established - both theoretically and empirically - that LA-SDG markedly improves the cost-delay tradeoff over state-of-the-art allocation schemes

    Stochastic algorithms with geometric step decay converge linearly on sharp functions

    Full text link
    Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions

    SUCAG: Stochastic Unbiased Curvature-aided Gradient Method for Distributed Optimization

    Full text link
    We propose and analyze a new stochastic gradient method, which we call Stochastic Unbiased Curvature-aided Gradient (SUCAG), for finite sum optimization problems. SUCAG constitutes an unbiased total gradient tracking technique that uses Hessian information to accelerate con- vergence. We analyze our method under the general asynchronous model of computation, in which each function is selected infinitely often with possibly unbounded (but sublinear) delay. For strongly convex problems, we establish linear convergence for the SUCAG method. When the initialization point is sufficiently close to the optimal solution, the established convergence rate is only dependent on the condition number of the problem, making it strictly faster than the known rate for the SAGA method. Furthermore, we describe a Markov-driven approach of implementing the SUCAG method in a distributed asynchronous multi-agent setting, via gossiping along a random walk on an undirected communication graph. We show that our analysis applies as long as the graph is connected and, notably, establishes an asymptotic linear convergence rate that is robust to the graph topology. Numerical results demonstrate the merits of our algorithm over existing methods.Comment: to appear in CDC 2018, 17 pages, 2 figure

    Convergence Rate of Distributed Optimization Algorithms Based on Gradient Tracking

    Full text link
    We study distributed, strongly convex and nonconvex, multiagent optimization over (directed, time-varying) graphs. We consider the minimization of the sum of a smooth (possibly nonconvex) function--the agent's sum-utility plus a nonsmooth convex one, subject to convex constraints. In a companion paper, we introduced SONATA, the first algorithmic framework applicable to such a general class of composite minimization, and we studied its convergence when the smooth part of the objective function is nonconvex. The algorithm combines successive convex approximation techniques with a perturbed push-sum consensus mechanism that aims to track locally the gradient of the (smooth part of the) sum-utility. This paper studies the convergence rate of SONATA. When the smooth part of the objective function is strongly convex, SONATA is proved to converge at a linear rate whereas sublinar rate is proved when the objective function is nonconvex. To our knowledge, this is the first work proving a convergence rate (in particular, linear rate) for distributed algorithms applicable to such a general class of composite, constrained optimization problems over graphs

    Distributed Stochastic Multi-Task Learning with Graph Regularization

    Full text link
    We propose methods for distributed graph-based multi-task learning that are based on weighted averaging of messages from other machines. Uniform averaging or diminishing stepsize in these methods would yield consensus (single task) learning. We show how simply skewing the averaging weights or controlling the stepsize allows learning different, but related, tasks on the different machines

    Understanding the Learned Iterative Soft Thresholding Algorithm with matrix factorization

    Full text link
    Sparse coding is a core building block in many data analysis and machine learning pipelines. Typically it is solved by relying on generic optimization techniques, such as the Iterative Soft Thresholding Algorithm and its accelerated version (ISTA, FISTA). These methods are optimal in the class of first-order methods for non-smooth, convex functions. However, they do not exploit the particular structure of the problem at hand nor the input data distribution. An acceleration using neural networks, coined LISTA, was proposed in Gregor and Le Cun (2010), which showed empirically that one could achieve high quality estimates with few iterations by modifying the parameters of the proximal splitting appropriately. In this paper we study the reasons for such acceleration. Our mathematical analysis reveals that it is related to a specific matrix factorization of the Gram kernel of the dictionary, which attempts to nearly diagonalise the kernel with a basis that produces a small perturbation of the â„“1\ell_1 ball. When this factorization succeeds, we prove that the resulting splitting algorithm enjoys an improved convergence bound with respect to the non-adaptive version. Moreover, our analysis also shows that conditions for acceleration occur mostly at the beginning of the iterative process, consistent with numerical experiments. We further validate our analysis by showing that on dictionaries where this factorization does not exist, adaptive acceleration fails.Comment: Ongoing work - This document is not complete and might contains errors. arXiv admin note: text overlap with arXiv:1609.0028

    The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure For Least Squares

    Full text link
    Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of T\sqrt{T} in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only loglog factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings.Comment: Appears in the proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2019. 28 pages, 4 tables, 1 Algorithm, 7 figure
    • …
    corecore