3,729 research outputs found
A jamming transition from under- to over-parametrization affects loss landscape and generalization
We argue that in fully-connected networks a phase transition delimits the
over- and under-parametrized regimes where fitting can or cannot be achieved.
Under some general conditions, we show that this transition is sharp for the
hinge loss. In the whole over-parametrized regime, poor minima of the loss are
not encountered during training since the number of constraints to satisfy is
too small to hamper minimization. Our findings support a link between this
transition and the generalization properties of the network: as we increase the
number of parameters of a given model, starting from an under-parametrized
network, we observe that the generalization error displays three phases: (i)
initial decay, (ii) increase until the transition point --- where it displays a
cusp --- and (iii) slow decay toward a constant for the rest of the
over-parametrized regime. Thereby we identify the region where the classical
phenomenon of over-fitting takes place, and the region where the model keeps
improving, in line with previous empirical observations for modern neural
networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
Techniques of replica symmetry breaking and the storage problem of the McCulloch-Pitts neuron
In this article the framework for Parisi's spontaneous replica symmetry
breaking is reviewed, and subsequently applied to the example of the
statistical mechanical description of the storage properties of a
McCulloch-Pitts neuron. The technical details are reviewed extensively, with
regard to the wide range of systems where the method may be applied. Parisi's
partial differential equation and related differential equations are discussed,
and a Green function technique introduced for the calculation of replica
averages, the key to determining the averages of physical quantities. The
ensuing graph rules involve only tree graphs, as appropriate for a
mean-field-like model. The lowest order Ward-Takahashi identity is recovered
analytically and is shown to lead to the Goldstone modes in continuous replica
symmetry breaking phases. The need for a replica symmetry breaking theory in
the storage problem of the neuron has arisen due to the thermodynamical
instability of formerly given solutions. Variational forms for the neuron's
free energy are derived in terms of the order parameter function x(q), for
different prior distribution of synapses. Analytically in the high temperature
limit and numerically in generic cases various phases are identified, among
them one similar to the Parisi phase in the Sherrington-Kirkpatrick model.
Extensive quantities like the error per pattern change slightly with respect to
the known unstable solutions, but there is a significant difference in the
distribution of non-extensive quantities like the synaptic overlaps and the
pattern storage stability parameter. A simulation result is also reviewed and
compared to the prediction of the theory.Comment: 103 Latex pages (with REVTeX 3.0), including 15 figures (ps, epsi,
eepic), accepted for Physics Report
Techniques of replica symmetry breaking and the storage problem of the McCulloch-Pitts neuron
In this article the framework for Parisi's spontaneous replica symmetry
breaking is reviewed, and subsequently applied to the example of the
statistical mechanical description of the storage properties of a
McCulloch-Pitts neuron. The technical details are reviewed extensively, with
regard to the wide range of systems where the method may be applied. Parisi's
partial differential equation and related differential equations are discussed,
and a Green function technique introduced for the calculation of replica
averages, the key to determining the averages of physical quantities. The
ensuing graph rules involve only tree graphs, as appropriate for a
mean-field-like model. The lowest order Ward-Takahashi identity is recovered
analytically and is shown to lead to the Goldstone modes in continuous replica
symmetry breaking phases. The need for a replica symmetry breaking theory in
the storage problem of the neuron has arisen due to the thermodynamical
instability of formerly given solutions. Variational forms for the neuron's
free energy are derived in terms of the order parameter function x(q), for
different prior distribution of synapses. Analytically in the high temperature
limit and numerically in generic cases various phases are identified, among
them one similar to the Parisi phase in the Sherrington-Kirkpatrick model.
Extensive quantities like the error per pattern change slightly with respect to
the known unstable solutions, but there is a significant difference in the
distribution of non-extensive quantities like the synaptic overlaps and the
pattern storage stability parameter. A simulation result is also reviewed and
compared to the prediction of the theory.Comment: 103 Latex pages (with REVTeX 3.0), including 15 figures (ps, epsi,
eepic), accepted for Physics Report
Input and Weight Space Smoothing for Semi-supervised Learning
We propose regularizing the empirical loss for semi-supervised learning by
acting on both the input (data) space, and the weight (parameter) space. We
show that the two are not equivalent, and in fact are complementary, one
affecting the minimality of the resulting representation, the other
insensitivity to nuisance variability. We propose a method to perform such
smoothing, which combines known input-space smoothing with a novel weight-space
smoothing, based on a min-max (adversarial) optimization. The resulting
Adversarial Block Coordinate Descent (ABCD) algorithm performs gradient ascent
with a small learning rate for a random subset of the weights, and standard
gradient descent on the remaining weights in the same mini-batch. It achieves
comparable performance to the state-of-the-art without resorting to heavy data
augmentation, using a relatively simple architecture
- …