31 research outputs found
Implicit Bias of Large Depth Networks: a Notion of Rank for Nonlinear Functions
We show that the representation cost of fully connected neural networks with
homogeneous nonlinearities - which describes the implicit bias in function
space of networks with -regularization or with losses such as the
cross-entropy - converges as the depth of the network goes to infinity to a
notion of rank over nonlinear functions. We then inquire under which conditions
the global minima of the loss recover the `true' rank of the data: we show that
for too large depths the global minimum will be approximately rank 1
(underestimating the rank); we then argue that there is a range of depths which
grows with the number of datapoints where the true rank is recovered. Finally,
we discuss the effect of the rank of a classifier on the topology of the
resulting class boundaries and show that autoencoders with optimal nonlinear
rank are naturally denoising
Bottleneck Structure in Learned Features: Low-Dimension vs Regularity Tradeoff
Previous work has shown that DNNs with large depth and
-regularization are biased towards learning low-dimensional
representations of the inputs, which can be interpreted as minimizing a notion
of rank of the learned function , conjectured to be the
Bottleneck rank. We compute finite depth corrections to this result, revealing
a measure of regularity which bounds the pseudo-determinant of the
Jacobian and is subadditive under composition and
addition. This formalizes a balance between learning low-dimensional
representations and minimizing complexity/irregularity in the feature maps,
allowing the network to learn the `right' inner dimension. We also show how
large learning rates also control the regularity of the learned function.
Finally, we use these theoretical tools to prove the conjectured bottleneck
structure in the learned features as : for large depths, almost all
hidden representations concentrates around -dimensional
representations. These limiting low-dimensional representation can be described
using the second correction
Implicit bias of SGD in -regularized linear DNNs: One-way jumps from high to low rank
The -regularized loss of Deep Linear Networks (DLNs) with more than
one hidden layers has multiple local minima, corresponding to matrices with
different ranks. In tasks such as matrix completion, the goal is to converge to
the local minimum with the smallest rank that still fits the training data.
While rank-underestimating minima can be avoided since they do not fit the
data, GD might get stuck at rank-overestimating minima. We show that with SGD,
there is always a probability to jump from a higher rank minimum to a lower
rank one, but the probability of jumping back is zero. More precisely, we
define a sequence of sets so
that contains all minima of rank or less (and not more) that are
absorbing for small enough ridge parameters and learning rates
: SGD has prob. 0 of leaving , and from any starting point there
is a non-zero prob. for SGD to go in
Disentangling feature and lazy training in deep neural networks
Two distinct limits for deep learning have been derived as the network width
, depending on how the weights of the last layer scale
with . In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear
in the weights and is described by a frozen kernel . By contrast, in
the Mean-Field limit, the dynamics can be expressed in terms of the
distribution of the parameters associated with a neuron, that follows a partial
differential equation. In this work we consider deep networks where the weights
in the last layer scale as at initialization. By varying
and , we probe the crossover between the two limits. We observe the
previously identified regimes of lazy training and feature training. In the
lazy-training regime, the dynamics is almost linear and the NTK barely changes
after initialization. The feature-training regime includes the mean-field
formulation as a limiting case and is characterized by a kernel that evolves in
time, and learns some features. We perform numerical experiments on MNIST,
Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find
that (i) The two regimes are separated by an that scales as
. (ii) Network architecture and data structure play an important role
in determining which regime is better: in our tests, fully-connected networks
perform generally better in the lazy-training regime, unlike convolutional
networks. (iii) In both regimes, the fluctuations induced on the
learned function by initial conditions decay as ,
leading to a performance that increases with . The same improvement can also
be obtained at an intermediate width by ensemble-averaging several networks.
(iv) In the feature-training regime we identify a time scale
, such that for the dynamics is linear.Comment: minor revision