5,610 research outputs found
The Break-Even Point on Optimization Trajectories of Deep Neural Networks
The early phase of training of deep neural networks is critical for their
final performance. In this work, we study how the hyperparameters of stochastic
gradient descent (SGD) used in the early phase of training affect the rest of
the optimization trajectory. We argue for the existence of the "break-even"
point on this trajectory, beyond which the curvature of the loss surface and
noise in the gradient are implicitly regularized by SGD. In particular, we
demonstrate on multiple classification tasks that using a large learning rate
in the initial phase of training reduces the variance of the gradient, and
improves the conditioning of the covariance of gradients. These effects are
beneficial from the optimization perspective and become visible after the
break-even point. Complementing prior work, we also show that using a low
learning rate results in bad conditioning of the loss surface even for a neural
network with batch normalization layers. In short, our work shows that key
properties of the loss surface are strongly influenced by SGD in the early
phase of training. We argue that studying the impact of the identified effects
on generalization is a promising future direction.Comment: Accepted as a spotlight at ICLR 2020. The last two authors
contributed equall
Rotational Optimizers: Simple & Robust DNN Training
The training dynamics of modern deep neural networks depend on complex
interactions between the learning rate, weight decay, initialization, and other
hyperparameters. These interactions can give rise to Spherical Motion Dynamics
in scale-invariant layers (e.g., normalized layers), which converge to an
equilibrium state, where the weight norm and the expected rotational update
size are fixed. Our analysis of this equilibrium in AdamW, SGD with momentum,
and Lion provides new insights into the effects of different hyperparameters
and their interactions on the training process. We propose rotational variants
(RVs) of these optimizers that force the expected angular update size to match
the equilibrium value throughout training. This simplifies the training
dynamics by removing the transient phase corresponding to the convergence to an
equilibrium. Our rotational optimizers can match the performance of the
original variants, often with minimal or no tuning of the baseline
hyperparameters, showing that these transient phases are not needed.
Furthermore, we find that the rotational optimizers have a reduced need for
learning rate warmup and improve the optimization of poorly normalized
networks.Comment: 23 pages, 9 figure
Inefficiency of K-FAC for Large Batch Size Training
In stochastic optimization, using large batch sizes during training can
leverage parallel resources to produce faster wall-clock training times per
training epoch. However, for both training loss and testing error, recent
results analyzing large batch Stochastic Gradient Descent (SGD) have found
sharp diminishing returns, beyond a certain critical batch size. In the hopes
of addressing this, it has been suggested that the Kronecker-Factored
Approximate Curvature (\mbox{K-FAC}) method allows for greater scalability to
large batch sizes, for non-convex machine learning problems such as neural
network optimization, as well as greater robustness to variation in model
hyperparameters. Here, we perform a detailed empirical analysis of large batch
size training %of these two hypotheses, for both \mbox{K-FAC} and SGD,
evaluating performance in terms of both wall-clock time and aggregate
computational cost. Our main results are twofold: first, we find that both
\mbox{K-FAC} and SGD doesn't have ideal scalability behavior beyond a certain
batch size, and that \mbox{K-FAC} does not exhibit improved large-batch
scalability behavior, as compared to SGD; and second, we find that
\mbox{K-FAC}, in addition to requiring more hyperparameters to tune, suffers
from similar hyperparameter sensitivity behavior as does SGD. We discuss
extensive results using ResNet and AlexNet on \mbox{CIFAR-10} and SVHN,
respectively, as well as more general implications of our findings
DeepOBS: A Deep Learning Optimizer Benchmark Suite
Because the choice and tuning of the optimizer affects the speed, and
ultimately the performance of deep learning, there is significant past and
recent research in this area. Yet, perhaps surprisingly, there is no generally
agreed-upon protocol for the quantitative and reproducible evaluation of
optimization strategies for deep learning. We suggest routines and benchmarks
for stochastic optimization, with special focus on the unique aspects of deep
learning, such as stochasticity, tunability and generalization. As the primary
contribution, we present DeepOBS, a Python package of deep learning
optimization benchmarks. The package addresses key challenges in the
quantitative assessment of stochastic optimizers, and automates most steps of
benchmarking. The library includes a wide and extensible set of ready-to-use
realistic optimization problems, such as training Residual Networks for image
classification on ImageNet or character-level language prediction models, as
well as popular classics like MNIST and CIFAR-10. The package also provides
realistic baseline results for the most popular optimizers on these test
problems, ensuring a fair comparison to the competition when benchmarking new
optimizers, and without having to run costly experiments. It comes with output
back-ends that directly produce LaTeX code for inclusion in academic
publications. It supports TensorFlow and is available open source.Comment: Accepted at ICLR 2019. 9 pages, 3 figures, 2 table
- …