38,038 research outputs found
A Mean Field View of the Landscape of Two-Layers Neural Networks
Multi-layer neural networks are among the most powerful models in machine
learning, yet the fundamental reasons for this success defy mathematical
understanding. Learning a neural network requires to optimize a non-convex
high-dimensional objective (risk function), a problem which is usually attacked
using stochastic gradient descent (SGD). Does SGD converge to a global optimum
of the risk or only to a local optimum? In the first case, does this happen
because local minima are absent, or because SGD somehow avoids them? In the
second, why do local minima reached by SGD have good generalization properties?
In this paper we consider a simple case, namely two-layers neural networks,
and prove that -in a suitable scaling limit- SGD dynamics is captured by a
certain non-linear partial differential equation (PDE) that we call
distributional dynamics (DD). We then consider several specific examples, and
show how DD can be used to prove convergence of SGD to networks with nearly
ideal generalization error. This description allows to 'average-out' some of
the complexities of the landscape of neural networks, and can be used to prove
a general convergence result for noisy SGD.Comment: 103 page
Phase Transitions of Neural Networks
The cooperative behaviour of interacting neurons and synapses is studied
using models and methods from statistical physics. The competition between
training error and entropy may lead to discontinuous properties of the neural
network. This is demonstrated for a few examples: Perceptron, associative
memory, learning from examples, generalization, multilayer networks, structure
recognition, Bayesian estimate, on-line training, noise estimation and time
series generation.Comment: Plenary talk for MINERVA workshop on mesoscopics, fractals and neural
networks, Eilat, March 1997 Postscript Fil
A jamming transition from under- to over-parametrization affects loss landscape and generalization
We argue that in fully-connected networks a phase transition delimits the
over- and under-parametrized regimes where fitting can or cannot be achieved.
Under some general conditions, we show that this transition is sharp for the
hinge loss. In the whole over-parametrized regime, poor minima of the loss are
not encountered during training since the number of constraints to satisfy is
too small to hamper minimization. Our findings support a link between this
transition and the generalization properties of the network: as we increase the
number of parameters of a given model, starting from an under-parametrized
network, we observe that the generalization error displays three phases: (i)
initial decay, (ii) increase until the transition point --- where it displays a
cusp --- and (iii) slow decay toward a constant for the rest of the
over-parametrized regime. Thereby we identify the region where the classical
phenomenon of over-fitting takes place, and the region where the model keeps
improving, in line with previous empirical observations for modern neural
networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934
- …