3,729 research outputs found

    A jamming transition from under- to over-parametrization affects loss landscape and generalization

    Full text link
    We argue that in fully-connected networks a phase transition delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. Under some general conditions, we show that this transition is sharp for the hinge loss. In the whole over-parametrized regime, poor minima of the loss are not encountered during training since the number of constraints to satisfy is too small to hamper minimization. Our findings support a link between this transition and the generalization properties of the network: as we increase the number of parameters of a given model, starting from an under-parametrized network, we observe that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point --- where it displays a cusp --- and (iii) slow decay toward a constant for the rest of the over-parametrized regime. Thereby we identify the region where the classical phenomenon of over-fitting takes place, and the region where the model keeps improving, in line with previous empirical observations for modern neural networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934

    Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

    Full text link
    Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the presentation in Version 3. In AAAI 202

    Techniques of replica symmetry breaking and the storage problem of the McCulloch-Pitts neuron

    Full text link
    In this article the framework for Parisi's spontaneous replica symmetry breaking is reviewed, and subsequently applied to the example of the statistical mechanical description of the storage properties of a McCulloch-Pitts neuron. The technical details are reviewed extensively, with regard to the wide range of systems where the method may be applied. Parisi's partial differential equation and related differential equations are discussed, and a Green function technique introduced for the calculation of replica averages, the key to determining the averages of physical quantities. The ensuing graph rules involve only tree graphs, as appropriate for a mean-field-like model. The lowest order Ward-Takahashi identity is recovered analytically and is shown to lead to the Goldstone modes in continuous replica symmetry breaking phases. The need for a replica symmetry breaking theory in the storage problem of the neuron has arisen due to the thermodynamical instability of formerly given solutions. Variational forms for the neuron's free energy are derived in terms of the order parameter function x(q), for different prior distribution of synapses. Analytically in the high temperature limit and numerically in generic cases various phases are identified, among them one similar to the Parisi phase in the Sherrington-Kirkpatrick model. Extensive quantities like the error per pattern change slightly with respect to the known unstable solutions, but there is a significant difference in the distribution of non-extensive quantities like the synaptic overlaps and the pattern storage stability parameter. A simulation result is also reviewed and compared to the prediction of the theory.Comment: 103 Latex pages (with REVTeX 3.0), including 15 figures (ps, epsi, eepic), accepted for Physics Report

    Techniques of replica symmetry breaking and the storage problem of the McCulloch-Pitts neuron

    Full text link
    In this article the framework for Parisi's spontaneous replica symmetry breaking is reviewed, and subsequently applied to the example of the statistical mechanical description of the storage properties of a McCulloch-Pitts neuron. The technical details are reviewed extensively, with regard to the wide range of systems where the method may be applied. Parisi's partial differential equation and related differential equations are discussed, and a Green function technique introduced for the calculation of replica averages, the key to determining the averages of physical quantities. The ensuing graph rules involve only tree graphs, as appropriate for a mean-field-like model. The lowest order Ward-Takahashi identity is recovered analytically and is shown to lead to the Goldstone modes in continuous replica symmetry breaking phases. The need for a replica symmetry breaking theory in the storage problem of the neuron has arisen due to the thermodynamical instability of formerly given solutions. Variational forms for the neuron's free energy are derived in terms of the order parameter function x(q), for different prior distribution of synapses. Analytically in the high temperature limit and numerically in generic cases various phases are identified, among them one similar to the Parisi phase in the Sherrington-Kirkpatrick model. Extensive quantities like the error per pattern change slightly with respect to the known unstable solutions, but there is a significant difference in the distribution of non-extensive quantities like the synaptic overlaps and the pattern storage stability parameter. A simulation result is also reviewed and compared to the prediction of the theory.Comment: 103 Latex pages (with REVTeX 3.0), including 15 figures (ps, epsi, eepic), accepted for Physics Report

    Input and Weight Space Smoothing for Semi-supervised Learning

    Full text link
    We propose regularizing the empirical loss for semi-supervised learning by acting on both the input (data) space, and the weight (parameter) space. We show that the two are not equivalent, and in fact are complementary, one affecting the minimality of the resulting representation, the other insensitivity to nuisance variability. We propose a method to perform such smoothing, which combines known input-space smoothing with a novel weight-space smoothing, based on a min-max (adversarial) optimization. The resulting Adversarial Block Coordinate Descent (ABCD) algorithm performs gradient ascent with a small learning rate for a random subset of the weights, and standard gradient descent on the remaining weights in the same mini-batch. It achieves comparable performance to the state-of-the-art without resorting to heavy data augmentation, using a relatively simple architecture
    corecore