12 research outputs found
Implicit Gradient Regularization
Gradient descent can be surprisingly good at optimizing deep neural networks
without overfitting and without explicit regularization. We find that the
discrete steps of gradient descent implicitly regularize models by penalizing
gradient descent trajectories that have large loss gradients. We call this
Implicit Gradient Regularization (IGR) and we use backward error analysis to
calculate the size of this regularization. We confirm empirically that implicit
gradient regularization biases gradient descent toward flat minima, where test
errors are small and solutions are robust to noisy parameter perturbations.
Furthermore, we demonstrate that the implicit gradient regularization term can
be used as an explicit regularizer, allowing us to control this gradient
regularization directly. More broadly, our work indicates that backward error
analysis is a useful theoretical approach to the perennial question of how
learning rate, model size, and parameter regularization interact to determine
the properties of overparameterized models optimized with gradient descent
The Statistical Complexity of Early-Stopped Mirror Descent
Recently there has been a surge of interest in understanding implicit
regularization properties of iterative gradient-based optimization algorithms.
In this paper, we study the statistical guarantees on the excess risk achieved
by early-stopped unconstrained mirror descent algorithms applied to the
unregularized empirical risk with the squared loss for linear models and kernel
methods. By completing an inequality that characterizes convexity for the
squared loss, we identify an intrinsic link between offset Rademacher
complexities and potential-based convergence analysis of mirror descent
methods. Our observation immediately yields excess risk guarantees for the path
traced by the iterates of mirror descent in terms of offset complexities of
certain function classes depending only on the choice of the mirror map,
initialization point, step-size, and the number of iterations. We apply our
theory to recover, in a clean and elegant manner via rather short proofs, some
of the recent results in the implicit regularization literature, while also
showing how to improve upon them in some settings
On Implicit Bias in Overparameterized Bilevel Optimization
Many problems in machine learning involve bilevel optimization (BLO),
including hyperparameter optimization, meta-learning, and dataset distillation.
Bilevel problems consist of two nested sub-problems, called the outer and inner
problems, respectively. In practice, often at least one of these sub-problems
is overparameterized. In this case, there are many ways to choose among optima
that achieve equivalent objective values. Inspired by recent studies of the
implicit bias induced by optimization algorithms in single-level optimization,
we investigate the implicit bias of gradient-based algorithms for bilevel
optimization. We delineate two standard BLO methods -- cold-start and
warm-start -- and show that the converged solution or long-run behavior depends
to a large degree on these and other algorithmic choices, such as the
hypergradient approximation. We also show that the inner solutions obtained by
warm-start BLO can encode a surprising amount of information about the outer
objective, even when the outer parameters are low-dimensional. We believe that
implicit bias deserves as central a role in the study of bilevel optimization
as it has attained in the study of single-level neural net optimization.Comment: ICML 202