22 research outputs found
Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives
Stochastic gradient descent (SGD) has been widely studied in the literature
from different angles, and is commonly employed for solving many big data
machine learning problems. However, the averaging technique, which combines all
iterative solutions into a single solution, is still under-explored. While some
increasingly weighted averaging schemes have been considered in the literature,
existing works are mostly restricted to strongly convex objective functions and
the convergence of optimization error. It remains unclear how these averaging
schemes affect the convergence of {\it both optimization error and
generalization error} (two equally important components of testing error) for
{\bf non-strongly convex objectives, including non-convex problems}. In this
paper, we {\it fill the gap} by comprehensively analyzing the increasingly
weighted averaging on convex, strongly convex and non-convex objective
functions in terms of both optimization error and generalization error. In
particular, we analyze a family of increasingly weighted averaging, where the
weight for the solution at iteration is proportional to
(). We show how affects the optimization error and the
generalization error, and exhibit the trade-off caused by . Experiments
have demonstrated this trade-off and the effectiveness of polynomially
increased weighted averaging compared with other averaging schemes for a wide
range of problems including deep learning
DINO: Distributed Newton-Type Optimization Method
We present a novel communication-efficient Newton-type algorithm for
finite-sum optimization over a distributed computing environment. Our method,
named DINO, overcomes both theoretical and practical shortcomings of similar
existing methods. Under minimal assumptions, we guarantee global sub-linear
convergence of DINO to a first-order stationary point for general non-convex
functions and arbitrary data distribution over the network. Furthermore, for
functions satisfying Polyak-Lojasiewicz (PL) inequality, we show that DINO
enjoys a linear convergence rate. Our proposed algorithm is practically
parameter free, in that it will converge regardless of the selected
hyper-parameters, which are easy to tune. Additionally, its sub-problems are
simple linear least-squares, for which efficient solvers exist. Numerical
simulations demonstrate the efficiency of DINO as compared with similar
alternatives.Comment: 16 page
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
Modern machine learning focuses on highly expressive models that are able to
fit or interpolate the data completely, resulting in zero training loss. For
such models, we show that the stochastic gradients of common loss functions
satisfy a strong growth condition. Under this condition, we prove that constant
step-size stochastic gradient descent (SGD) with Nesterov acceleration matches
the convergence rate of the deterministic accelerated method for both convex
and strongly-convex functions. We also show that this condition implies that
SGD can find a first-order stationary point as efficiently as full gradient
descent in non-convex settings. Under interpolation, we further show that all
smooth loss functions with a finite-sum structure satisfy a weaker growth
condition. Given this weaker condition, we prove that SGD with a constant
step-size attains the deterministic convergence rate in both the
strongly-convex and convex settings. Under additional assumptions, the above
results enable us to prove an O(1/k^2) mistake bound for k iterations of a
stochastic perceptron algorithm using the squared-hinge loss. Finally, we
validate our theoretical findings with experiments on synthetic and real
datasets.Comment: AISTATS 201
Distributed Optimization for Over-Parameterized Learning
Distributed optimization often consists of two updating phases: local
optimization and inter-node communication. Conventional approaches require
working nodes to communicate with the server every one or few iterations to
guarantee convergence. In this paper, we establish a completely different
conclusion that each node can perform an arbitrary number of local optimization
steps before communication. Moreover, we show that the more local updating can
reduce the overall communication, even for an infinity number of steps where
each node is free to update its local model to near-optimality before
exchanging information. The extra assumption we make is that the optimal sets
of local loss functions have a non-empty intersection, which is inspired by the
over-paramterization phenomenon in large-scale optimization and deep learning.
Our theoretical findings are confirmed by both distributed convex optimization
and deep learning experiments
Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
Recent works have shown that stochastic gradient descent (SGD) achieves the
fast convergence rates of full-batch gradient descent for over-parameterized
models satisfying certain interpolation conditions. However, the step-size used
in these works depends on unknown quantities and SGD's practical performance
heavily relies on the choice of this step-size. We propose to use line-search
techniques to automatically set the step-size when training models that can
interpolate the data. In the interpolation setting, we prove that SGD with a
stochastic variant of the classic Armijo line-search attains the deterministic
convergence rates for both convex and strongly-convex functions. Under
additional assumptions, SGD with Armijo line-search is shown to achieve fast
convergence for non-convex functions. Furthermore, we show that stochastic
extra-gradient with a Lipschitz line-search attains linear convergence for an
important class of non-convex functions and saddle-point problems satisfying
interpolation. To improve the proposed methods' practical performance, we give
heuristics to use larger step-sizes and acceleration. We compare the proposed
algorithms against numerous optimization methods on standard classification
tasks using both kernel methods and deep networks. The proposed methods result
in competitive performance across all models and datasets, while being robust
to the precise choices of hyper-parameters. For multi-class classification
using deep networks, SGD with Armijo line-search results in both faster
convergence and better generalization.Comment: Added a citation to the related work of Paul Tseng, and citations to
methods that had previously explored line-searches for deep learning
empiricall
Linear Convergence and Implicit Regularization of Generalized Mirror Descent with Time-Dependent Mirrors
The following questions are fundamental to understanding the properties of
over-parameterization in modern machine learning: (1) Under what conditions and
at what rate does training converge to a global minimum? (2) What form of
implicit regularization occurs through training? While significant progress has
been made in answering both of these questions for gradient descent, they have
yet to be answered more completely for general optimization methods. In this
work, we establish sufficient conditions for linear convergence and obtain
approximate implicit regularization results for generalized mirror descent
(GMD), a generalization of mirror descent with a possibly time-dependent
mirror. GMD subsumes popular first order optimization methods including
gradient descent, mirror descent, and preconditioned gradient descent methods
such as Adagrad. By using the Polyak-Lojasiewicz inequality, we first present a
simple analysis under which non-stochastic GMD converges linearly to a global
minimum. We then present a novel, Taylor-series based analysis to establish
sufficient conditions for linear convergence of stochastic GMD. As a corollary,
our result establishes sufficient conditions and provides learning rates for
linear convergence of stochastic mirror descent and Adagrad. Lastly, we obtain
approximate implicit regularization results for GMD by proving that GMD
converges to an interpolating solution that is approximately the closest
interpolating solution to the initialization in l2-norm in the dual space,
thereby generalizing the result of Azizan, Lale, and Hassibi (2019) in the full
batch setting
Identity Crisis: Memorization and Generalization under Extreme Overparameterization
We study the interplay between memorization and generalization of
overparameterized networks in the extreme case of a single training example and
an identity-mapping task. We examine fully-connected and convolutional networks
(FCN and CNN), both linear and nonlinear, initialized randomly and then trained
to minimize the reconstruction error. The trained networks stereotypically take
one of two forms: the constant function (memorization) and the identity
function (generalization). We formally characterize generalization in
single-layer FCNs and CNNs. We show empirically that different architectures
exhibit strikingly different inductive biases. For example, CNNs of up to 10
layers are able to generalize from a single example, whereas FCNs cannot learn
the identity function reliably from 60k examples. Deeper CNNs often fail, but
nonetheless do astonishing work to memorize the training output: because CNN
biases are location invariant, the model must progressively grow an output
pattern from the image boundaries via the coordination of many layers. Our work
helps to quantify and visualize the sensitivity of inductive biases to
architectural choices such as depth, kernel width, and number of channels.Comment: ICLR 202
Fast Dimension Independent Private AdaGrad on Publicly Estimated Subspaces
We revisit the problem of empirical risk minimziation (ERM) with differential
privacy. We show that noisy AdaGrad, given appropriate knowledge and conditions
on the subspace from which gradients can be drawn, achieves a regret comparable
to traditional AdaGrad plus a well-controlled term due to noise. We show a
convergence rate of , where captures the geometry of
the gradient subspace. Since we can obtain faster
rates for convex and Lipschitz functions, compared to the rate
achieved by known versions of noisy (stochastic) gradient descent with
comparable noise variance. In particular, we show that if the gradients lie in
a known constant rank subspace, and assuming algorithmic access to an envelope
which bounds decaying sensitivity, one can achieve faster convergence to an
excess empirical risk of , where is the
privacy budget and the number of samples. Letting be the problem
dimension, this result implies that, by running noisy Adagrad, we can bypass
the DP-SGD bound in iterations, where is a parameter
controlling gradient norm decay, instead of the rate achieved by SGD of
. Our results operate with general convex functions in both
constrained and unconstrained minimization.
Along the way, we do a perturbation analysis of noisy AdaGrad of independent
interest. Our utility guarantee for the private ERM problem follows as a
corollary to the regret guarantee of noisy AdaGrad
Stopping Criteria for, and Strong Convergence of, Stochastic Gradient Descent on Bottou-Curtis-Nocedal Functions
Stopping criteria for Stochastic Gradient Descent (SGD) methods play
important roles from enabling adaptive step size schemes to providing rigor for
downstream analyses such as asymptotic inference. Unfortunately, current
stopping criteria for SGD methods are often heuristics that rely on asymptotic
normality results or convergence to stationary distributions, which may fail to
exist for nonconvex functions and, thereby, limit the applicability of such
stopping criteria. To address this issue, in this work, we rigorously develop
two stopping criteria for SGD that can be applied to a broad class of nonconvex
functions, which we term Bottou-Curtis-Nocedal functions. Moreover, as a
prerequisite for developing these stopping criteria, we prove that the gradient
function evaluated at SGD's iterates converges strongly to zero for
Bottou-Curtis-Nocedal functions, which addresses an open question in the SGD
literature. As a result of our work, our rigorously developed stopping criteria
can be used to develop new adaptive step size schemes or bolster other
downstream analyses for nonconvex functions
On Linear Stability of SGD and Input-Smoothness of Neural Networks
The multiplicative structure of parameters and input data in the first layer
of neural networks is explored to build connection between the landscape of the
loss function with respect to parameters and the landscape of the model
function with respect to input data. By this connection, it is shown that flat
minima regularize the gradient of the model function, which explains the good
generalization performance of flat minima. Then, we go beyond the flatness and
consider high-order moments of the gradient noise, and show that Stochastic
Gradient Descent (SGD) tends to impose constraints on these moments by a linear
stability analysis of SGD around global minima. Together with the
multiplicative structure, we identify the Sobolev regularization effect of SGD,
i.e. SGD regularizes the Sobolev seminorms of the model function with respect
to the input data. Finally, bounds for generalization error and adversarial
robustness are provided for solutions found by SGD under assumptions of the
data distribution