31 research outputs found

    Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions

    Full text link
    We show that the representation cost of fully connected neural networks with homogeneous nonlinearities - which describes the implicit bias in function space of networks with L2L_2-regularization or with losses such as the cross-entropy - converges as the depth of the network goes to infinity to a notion of rank over nonlinear functions. We then inquire under which conditions the global minima of the loss recover the `true' rank of the data: we show that for too large depths the global minimum will be approximately rank 1 (underestimating the rank); we then argue that there is a range of depths which grows with the number of datapoints where the true rank is recovered. Finally, we discuss the effect of the rank of a classifier on the topology of the resulting class boundaries and show that autoencoders with optimal nonlinear rank are naturally denoising

    Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff

    Full text link
    Previous work has shown that DNNs with large depth LL and L2L_{2}-regularization are biased towards learning low-dimensional representations of the inputs, which can be interpreted as minimizing a notion of rank R(0)(f)R^{(0)}(f) of the learned function ff, conjectured to be the Bottleneck rank. We compute finite depth corrections to this result, revealing a measure R(1)R^{(1)} of regularity which bounds the pseudo-determinant of the Jacobian ∣Jf(x)∣+\left|Jf(x)\right|_{+} and is subadditive under composition and addition. This formalizes a balance between learning low-dimensional representations and minimizing complexity/irregularity in the feature maps, allowing the network to learn the `right' inner dimension. We also show how large learning rates also control the regularity of the learned function. Finally, we use these theoretical tools to prove the conjectured bottleneck structure in the learned features as L→∞L\to\infty: for large depths, almost all hidden representations concentrates around R(0)(f)R^{(0)}(f)-dimensional representations. These limiting low-dimensional representation can be described using the second correction R(2)R^{(2)}

    Implicit bias of SGD in L2L_{2}-regularized linear DNNs: One-way jumps from high to low rank

    Full text link
    The L2L_{2}-regularized loss of Deep Linear Networks (DLNs) with more than one hidden layers has multiple local minima, corresponding to matrices with different ranks. In tasks such as matrix completion, the goal is to converge to the local minimum with the smallest rank that still fits the training data. While rank-underestimating minima can be avoided since they do not fit the data, GD might get stuck at rank-overestimating minima. We show that with SGD, there is always a probability to jump from a higher rank minimum to a lower rank one, but the probability of jumping back is zero. More precisely, we define a sequence of sets B1⊂B2⊂⋯⊂BRB_{1}\subset B_{2}\subset\cdots\subset B_{R} so that BrB_{r} contains all minima of rank rr or less (and not more) that are absorbing for small enough ridge parameters λ\lambda and learning rates η\eta: SGD has prob. 0 of leaving BrB_{r}, and from any starting point there is a non-zero prob. for SGD to go in BrB_{r}

    Disentangling feature and lazy training in deep neural networks

    Full text link
    Two distinct limits for deep learning have been derived as the network width h→∞h\rightarrow \infty, depending on how the weights of the last layer scale with hh. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ\Theta. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2\alpha h^{-1/2} at initialization. By varying α\alpha and hh, we probe the crossover between the two limits. We observe the previously identified regimes of lazy training and feature training. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that (i) The two regimes are separated by an α∗\alpha^* that scales as h−1/2h^{-1/2}. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF\delta F induced on the learned function by initial conditions decay as δF∼1/h\delta F\sim 1/\sqrt{h}, leading to a performance that increases with hh. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks. (iv) In the feature-training regime we identify a time scale t1∼hαt_1\sim\sqrt{h}\alpha, such that for t≪t1t\ll t_1 the dynamics is linear.Comment: minor revision

    Chinese bees

    Get PDF
    3 p. : ill. ; 23 cm
    corecore