17 research outputs found
An information and field theoretic approach to the grand canonical ensemble
We present a novel derivation of the constraints required to obtain the
underlying principles of statistical mechanics using a maximum entropy
framework. We derive the mean value constraints by use of the central limit
theorem and the scaling properties of Lagrange multipliers. We then arrive at
the same result using a quantum free field theory and the Ward identities. The
work provides a principled footing for maximum entropy methods in statistical
physics, adding the body of work aligned to Jaynes's vision of statistical
mechanics as a form of inference rather than a physical theory dependent on
ergodicity, metric transitivity and equal a priori probabilities. We show that
statistical independence, in the macroscopic limit, is the unifying concept
that leads to all these derivations.Comment: 7 pages, 3 pages of Appendi
Explaining the Adaptive Generalisation Gap
We conjecture that the inherent difference in generalisation between adaptive
and non-adaptive gradient methods stems from the increased estimation noise in
the flattest directions of the true loss surface. We demonstrate that typical
schedules used for adaptive methods (with low numerical stability or damping
constants) serve to bias relative movement towards flat directions relative to
sharp directions, effectively amplifying the noise-to-signal ratio and harming
generalisation. We further demonstrate that the numerical stability/damping
constant used in these methods can be decomposed into a learning rate reduction
and linear shrinkage of the estimated curvature matrix. We then demonstrate
significant generalisation improvements by increasing the shrinkage
coefficient, closing the generalisation gap entirely in both Logistic
Regression and Deep Neural Network experiments. Finally, we show that other
popular modifications to adaptive methods, such as decoupled weight decay and
partial adaptivity can be shown to calibrate parameter updates to make better
use of sharper, more reliable directions
Appearence of Random Matrix Theory in Deep Learning
We investigate the local spectral statistics of the loss surface Hessians of
artificial neural networks, where we discover excellent agreement with Gaussian
Orthogonal Ensemble statistics across several network architectures and
datasets. These results shed new light on the applicability of Random Matrix
Theory to modelling neural networks and suggest a previously unrecognised role
for it in the study of loss surfaces in deep learning. Inspired by these
observations, we propose a novel model for the true loss surfaces of neural
networks, consistent with our observations, which allows for Hessian spectral
densities with rank degeneracy and outliers, extensively observed in practice,
and predicts a growing independence of loss gradients as a function of distance
in weight-space. We further investigate the importance of the true loss surface
in neural networks and find, in contrast to previous work, that the exponential
hardness of locating the global minimum has practical consequences for
achieving state of the art performance.Comment: 33 pages, 14 figure