Search CORE

22 research outputs found

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Author: Guo Zhishuai
Yan Yan
Yang Tianbao
Publication venue
Publication date: 26/05/2020
Field of study

Stochastic gradient descent (SGD) has been widely studied in the literature from different angles, and is commonly employed for solving many big data machine learning problems. However, the averaging technique, which combines all iterative solutions into a single solution, is still under-explored. While some increasingly weighted averaging schemes have been considered in the literature, existing works are mostly restricted to strongly convex objective functions and the convergence of optimization error. It remains unclear how these averaging schemes affect the convergence of {\it both optimization error and generalization error} (two equally important components of testing error) for {\bf non-strongly convex objectives, including non-convex problems}. In this paper, we {\it fill the gap} by comprehensively analyzing the increasingly weighted averaging on convex, strongly convex and non-convex objective functions in terms of both optimization error and generalization error. In particular, we analyze a family of increasingly weighted averaging, where the weight for the solution at iteration

t

is proportional to

t^{\alpha}

(

\alpha > 0

). We show how

\alpha

affects the optimization error and the generalization error, and exhibit the trade-off caused by

\alpha

. Experiments have demonstrated this trade-off and the effectiveness of polynomially increased weighted averaging compared with other averaging schemes for a wide range of problems including deep learning

arXiv.org e-Print Archive

DINO: Distributed Newton-Type Optimization Method

Author: Crane Rixon
Roosta Fred
Publication venue
Publication date: 05/06/2020
Field of study

We present a novel communication-efficient Newton-type algorithm for finite-sum optimization over a distributed computing environment. Our method, named DINO, overcomes both theoretical and practical shortcomings of similar existing methods. Under minimal assumptions, we guarantee global sub-linear convergence of DINO to a first-order stationary point for general non-convex functions and arbitrary data distribution over the network. Furthermore, for functions satisfying Polyak-Lojasiewicz (PL) inequality, we show that DINO enjoys a linear convergence rate. Our proposed algorithm is practically parameter free, in that it will converge regardless of the selected hyper-parameters, which are easy to tune. Additionally, its sub-problems are simple linear least-squares, for which efficient solvers exist. Numerical simulations demonstrate the efficiency of DINO as compared with similar alternatives.Comment: 16 page

arXiv.org e-Print Archive

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Author: Bach Francis
Schmidt Mark
Vaswani Sharan
Publication venue
Publication date: 05/04/2019
Field of study

Modern machine learning focuses on highly expressive models that are able to fit or interpolate the data completely, resulting in zero training loss. For such models, we show that the stochastic gradients of common loss functions satisfy a strong growth condition. Under this condition, we prove that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. We also show that this condition implies that SGD can find a first-order stationary point as efficiently as full gradient descent in non-convex settings. Under interpolation, we further show that all smooth loss functions with a finite-sum structure satisfy a weaker growth condition. Given this weaker condition, we prove that SGD with a constant step-size attains the deterministic convergence rate in both the strongly-convex and convex settings. Under additional assumptions, the above results enable us to prove an O(1/k^2) mistake bound for k iterations of a stochastic perceptron algorithm using the squared-hinge loss. Finally, we validate our theoretical findings with experiments on synthetic and real datasets.Comment: AISTATS 201

arXiv.org e-Print Archive

Distributed Optimization for Over-Parameterized Learning

Author: Li Qianxiao
Zhang Chi
Publication venue
Publication date: 14/06/2019
Field of study

Distributed optimization often consists of two updating phases: local optimization and inter-node communication. Conventional approaches require working nodes to communicate with the server every one or few iterations to guarantee convergence. In this paper, we establish a completely different conclusion that each node can perform an arbitrary number of local optimization steps before communication. Moreover, we show that the more local updating can reduce the overall communication, even for an infinity number of steps where each node is free to update its local model to near-optimality before exchanging information. The extra assumption we make is that the optimal sets of local loss functions have a non-empty intersection, which is inspired by the over-paramterization phenomenon in large-scale optimization and deep learning. Our theoretical findings are confirmed by both distributed convex optimization and deep learning experiments

arXiv.org e-Print Archive

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Author: Gidel Gauthier
Lacoste-Julien Simon
Laradji Issam
Mishkin Aaron
Schmidt Mark
Vaswani Sharan
Publication venue
Publication date: 04/06/2021
Field of study

Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD's practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters. For multi-class classification using deep networks, SGD with Armijo line-search results in both faster convergence and better generalization.Comment: Added a citation to the related work of Paul Tseng, and citations to methods that had previously explored line-searches for deep learning empiricall

arXiv.org e-Print Archive

Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors

Author: Belkin Mikhail
Radhakrishnan Adityanarayanan
Uhler Caroline
Publication venue
Publication date: 17/09/2020
Field of study

The following questions are fundamental to understanding the properties of over-parameterization in modern machine learning: (1) Under what conditions and at what rate does training converge to a global minimum? (2) What form of implicit regularization occurs through training? While significant progress has been made in answering both of these questions for gradient descent, they have yet to be answered more completely for general optimization methods. In this work, we establish sufficient conditions for linear convergence and obtain approximate implicit regularization results for generalized mirror descent (GMD), a generalization of mirror descent with a possibly time-dependent mirror. GMD subsumes popular first order optimization methods including gradient descent, mirror descent, and preconditioned gradient descent methods such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a simple analysis under which non-stochastic GMD converges linearly to a global minimum. We then present a novel, Taylor-series based analysis to establish sufficient conditions for linear convergence of stochastic GMD. As a corollary, our result establishes sufficient conditions and provides learning rates for linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain approximate implicit regularization results for GMD by proving that GMD converges to an interpolating solution that is approximately the closest interpolating solution to the initialization in l2-norm in the dual space, thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full batch setting

arXiv.org e-Print Archive

Identity Crisis: Memorization and Generalization under Extreme Overparameterization

Author: Bengio Samy
Hardt Moritz
Mozer Michael C.
Singer Yoram
Zhang Chiyuan
Publication venue
Publication date: 08/01/2020
Field of study

We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task. We examine fully-connected and convolutional networks (FCN and CNN), both linear and nonlinear, initialized randomly and then trained to minimize the reconstruction error. The trained networks stereotypically take one of two forms: the constant function (memorization) and the identity function (generalization). We formally characterize generalization in single-layer FCNs and CNNs. We show empirically that different architectures exhibit strikingly different inductive biases. For example, CNNs of up to 10 layers are able to generalize from a single example, whereas FCNs cannot learn the identity function reliably from 60k examples. Deeper CNNs often fail, but nonetheless do astonishing work to memorize the training output: because CNN biases are location invariant, the model must progressively grow an output pattern from the image boundaries via the coordination of many layers. Our work helps to quantify and visualize the sensitivity of inductive biases to architectural choices such as depth, kernel width, and number of channels.Comment: ICLR 202

arXiv.org e-Print Archive

Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces

Author: Kairouz Peter
Ribero Mónica
Rush Keith
Thakurta Abhradeep
Publication venue
Publication date: 30/01/2021
Field of study

We revisit the problem of empirical risk minimziation (ERM) with differential privacy. We show that noisy AdaGrad, given appropriate knowledge and conditions on the subspace from which gradients can be drawn, achieves a regret comparable to traditional AdaGrad plus a well-controlled term due to noise. We show a convergence rate of

O(\text{Tr}(G_T)/T)

, where

G_T

captures the geometry of the gradient subspace. Since

\text{Tr}(G_T)=O(\sqrt{T})

we can obtain faster rates for convex and Lipschitz functions, compared to the

O(1/\sqrt{T})

rate achieved by known versions of noisy (stochastic) gradient descent with comparable noise variance. In particular, we show that if the gradients lie in a known constant rank subspace, and assuming algorithmic access to an envelope which bounds decaying sensitivity, one can achieve faster convergence to an excess empirical risk of

\tilde O(1/\epsilon n)

, where

\epsilon

is the privacy budget and

n

the number of samples. Letting

p

be the problem dimension, this result implies that, by running noisy Adagrad, we can bypass the DP-SGD bound

\tilde O(\sqrt{p}/\epsilon n)

T=(\epsilon n)^{2/(1+2\alpha)}

iterations, where

\alpha \geq 0

is a parameter controlling gradient norm decay, instead of the rate achieved by SGD of

T=\epsilon^2n^2

. Our results operate with general convex functions in both constrained and unconstrained minimization. Along the way, we do a perturbation analysis of noisy AdaGrad of independent interest. Our utility guarantee for the private ERM problem follows as a corollary to the regret guarantee of noisy AdaGrad

arXiv.org e-Print Archive

Stopping Criteria for, and Strong Convergence of, Stochastic Gradient Descent on Bottou-Curtis-Nocedal Functions

Author: Patel Vivak
Publication venue
Publication date: 01/04/2021
Field of study

Stopping criteria for Stochastic Gradient Descent (SGD) methods play important roles from enabling adaptive step size schemes to providing rigor for downstream analyses such as asymptotic inference. Unfortunately, current stopping criteria for SGD methods are often heuristics that rely on asymptotic normality results or convergence to stationary distributions, which may fail to exist for nonconvex functions and, thereby, limit the applicability of such stopping criteria. To address this issue, in this work, we rigorously develop two stopping criteria for SGD that can be applied to a broad class of nonconvex functions, which we term Bottou-Curtis-Nocedal functions. Moreover, as a prerequisite for developing these stopping criteria, we prove that the gradient function evaluated at SGD's iterates converges strongly to zero for Bottou-Curtis-Nocedal functions, which addresses an open question in the SGD literature. As a result of our work, our rigorously developed stopping criteria can be used to develop new adaptive step size schemes or bolster other downstream analyses for nonconvex functions

arXiv.org e-Print Archive

On Linear Stability of SGD and Input-Smoothness of Neural Networks

Author: Ma Chao
Ying Lexing
Publication venue
Publication date: 28/11/2021
Field of study

The multiplicative structure of parameters and input data in the first layer of neural networks is explored to build connection between the landscape of the loss function with respect to parameters and the landscape of the model function with respect to input data. By this connection, it is shown that flat minima regularize the gradient of the model function, which explains the good generalization performance of flat minima. Then, we go beyond the flatness and consider high-order moments of the gradient noise, and show that Stochastic Gradient Descent (SGD) tends to impose constraints on these moments by a linear stability analysis of SGD around global minima. Together with the multiplicative structure, we identify the Sobolev regularization effect of SGD, i.e. SGD regularizes the Sobolev seminorms of the model function with respect to the input data. Finally, bounds for generalization error and adversarial robustness are provided for solutions found by SGD under assumptions of the data distribution

arXiv.org e-Print Archive