7 research outputs found
Iterative Averaging in the Quest for Best Test Error
We analyse and explain the increased generalisation performance of iterate
averaging using a Gaussian process perturbation model between the true and
batch risk surface on the high dimensional quadratic. We derive three phenomena
\latestEdits{from our theoretical results:} (1) The importance of combining
iterate averaging (IA) with large learning rates and regularisation for
improved regularisation. (2) Justification for less frequent averaging. (3)
That we expect adaptive gradient methods to work equally well, or better, with
iterate averaging than their non-adaptive counterparts. Inspired by these
results\latestEdits{, together with} empirical investigations of the importance
of appropriate regularisation for the solution diversity of the iterates, we
propose two adaptive algorithms with iterate averaging. These give
significantly better results compared to stochastic gradient descent (SGD),
require less tuning and do not require early stopping or validation set
monitoring. We showcase the efficacy of our approach on the CIFAR-10/100,
ImageNet and Penn Treebank datasets on a variety of modern and classical
network architectures
Random matrix theory and the loss surfaces of neural networks
Neural network models are one of the most successful approaches to machine
learning, enjoying an enormous amount of development and research over recent
years and finding concrete real-world applications in almost any conceivable
area of science, engineering and modern life in general. The theoretical
understanding of neural networks trails significantly behind their practical
success and the engineering heuristics that have grown up around them. Random
matrix theory provides a rich framework of tools with which aspects of neural
network phenomenology can be explored theoretically. In this thesis, we
establish significant extensions of prior work using random matrix theory to
understand and describe the loss surfaces of large neural networks,
particularly generalising to different architectures. Informed by the
historical applications of random matrix theory in physics and elsewhere, we
establish the presence of local random matrix universality in real neural
networks and then utilise this as a modeling assumption to derive powerful and
novel results about the Hessians of neural network loss surfaces and their
spectra. In addition to these major contributions, we make use of random matrix
models for neural network loss surfaces to shed light on modern neural network
training approaches and even to derive a novel and effective variant of a
popular optimisation algorithm.
Overall, this thesis provides important contributions to cement the place of
random matrix theory in the theoretical study of modern neural networks,
reveals some of the limits of existing approaches and begins the study of an
entirely new role for random matrix theory in the theory of deep learning with
important experimental discoveries and novel theoretical results based on local
random matrix universality.Comment: 320 pages, PhD thesi