31 research outputs found

    Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

    Full text link
    In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.Comment: 37 pages, 12 figures, NeurIPS 202

    When Are Solutions Connected in Deep Networks?

    Full text link
    The question of how and why the phenomenon of mode connectivity occurs in training deep neural networks has gained remarkable attention in the research community. From a theoretical perspective, two possible explanations have been proposed: (i) the loss function has connected sublevel sets, and (ii) the solutions found by stochastic gradient descent are dropout stable. While these explanations provide insights into the phenomenon, their assumptions are not always satisfied in practice. In particular, the first approach requires the network to have one layer with order of NN neurons (NN being the number of training samples), while the second one requires the loss to be almost invariant after removing half of the neurons at each layer (up to some rescaling of the remaining ones). In this work, we improve both conditions by exploiting the quality of the features at every intermediate layer together with a milder over-parameterization condition. More specifically, we show that: (i) under generic assumptions on the features of intermediate layers, it suffices that the last two hidden layers have order of N\sqrt{N} neurons, and (ii) if subsets of features at each layer are linearly separable, then no over-parameterization is needed to show the connectivity. Our experiments confirm that the proposed condition ensures the connectivity of solutions found by stochastic gradient descent, even in settings where the previous requirements do not hold.Comment: Accepted at NeurIPS 202

    Equivariant Architectures for Learning in Deep Weight Spaces

    Full text link
    Designing machine learning architectures for processing neural networks in their raw weight matrix form is a newly introduced research direction. Unfortunately, the unique symmetry structure of deep weight spaces makes this design very challenging. If successful, such architectures would be capable of performing a wide range of intriguing tasks, from adapting a pre-trained network to a new domain to editing objects represented as functions (INRs or NeRFs). As a first step towards this goal, we present here a novel network architecture for learning in deep weight spaces. It takes as input a concatenation of weights and biases of a pre-trained MLP and processes it using a composition of layers that are equivariant to the natural permutation symmetry of the MLP's weights: Changing the order of neurons in intermediate layers of the MLP does not affect the function it represents. We provide a full characterization of all affine equivariant and invariant layers for these symmetries and show how these layers can be implemented using three basic operations: pooling, broadcasting, and fully connected layers applied to the input in an appropriate manner. We demonstrate the effectiveness of our architecture and its advantages over natural baselines in a variety of learning tasks.Comment: ICML 202

    Population Descent: A Natural-Selection Based Hyper-Parameter Tuning Framework

    Full text link
    First-order gradient descent has been the base of the most successful optimization algorithms ever implemented. On supervised learning problems with very high dimensionality, such as neural network optimization, it is almost always the algorithm of choice, mainly due to its memory and computational efficiency. However, it is a classical result in optimization that gradient descent converges to local minima on non-convex functions. Even more importantly, in certain high-dimensional cases, escaping the plateaus of large saddle points becomes intractable. On the other hand, black-box optimization methods are not sensitive to the local structure of a loss function's landscape but suffer the curse of dimensionality. Instead, memetic algorithms aim to combine the benefits of both. Inspired by this, we present Population Descent, a memetic algorithm focused on hyperparameter optimization. We show that an adaptive m-elitist selection approach combined with a normalized-fitness-based randomization scheme outperforms more complex state-of-the-art algorithms by up to 13% on common benchmark tasks

    Challenges in Markov chain Monte Carlo for Bayesian neural networks

    Full text link
    Markov chain Monte Carlo (MCMC) methods have not been broadly adopted in Bayesian neural networks (BNNs). This paper initially reviews the main challenges in sampling from the parameter posterior of a neural network via MCMC. Such challenges culminate to lack of convergence to the parameter posterior. Nevertheless, this paper shows that a non-converged Markov chain, generated via MCMC sampling from the parameter space of a neural network, can yield via Bayesian marginalization a valuable predictive posterior of the output of the neural network. Classification examples based on multilayer perceptrons showcase highly accurate predictive posteriors. The postulate of limited scope for MCMC developments in BNNs is partially valid; an asymptotically exact parameter posterior seems less plausible, yet an accurate predictive posterior is a tenable research avenue

    Fitness Landscape Analysis of Feed-Forward Neural Networks

    Get PDF
    Neural network training is a highly non-convex optimisation problem with poorly understood properties. Due to the inherent high dimensionality, neural network search spaces cannot be intuitively visualised, thus other means to establish search space properties have to be employed. Fitness landscape analysis encompasses a selection of techniques designed to estimate the properties of a search landscape associated with an optimisation problem. Applied to neural network training, fitness landscape analysis can be used to establish a link between the properties of the error landscape and various neural network hyperparameters. This study applies fitness landscape analysis to investigate the influence of the search space boundaries, regularisation parameters, loss functions, activation functions, and feed-forward neural network architectures on the properties of the resulting error landscape. A novel gradient-based sampling technique is proposed, together with a novel method to quantify and visualise stationary points and the associated basins of attraction in neural network error landscapes.Thesis (PhD)--University of Pretoria, 2019.NRFComputer SciencePhDUnrestricte
    corecore