154 research outputs found
A General Theory of Equivariant CNNs on Homogeneous Spaces
We present a general theory of Group equivariant Convolutional Neural
Networks (G-CNNs) on homogeneous spaces such as Euclidean space and the sphere.
Feature maps in these networks represent fields on a homogeneous base space,
and layers are equivariant maps between spaces of fields. The theory enables a
systematic classification of all existing G-CNNs in terms of their symmetry
group, base space, and field type. We also consider a fundamental question:
what is the most general kind of equivariant linear map between feature spaces
(fields) of given types? Following Mackey, we show that such maps correspond
one-to-one with convolutions using equivariant kernels, and characterize the
space of such kernels
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
How many training data are needed to learn a supervised task? It is often
observed that the generalization error decreases as where is
the number of training examples and an exponent that depends on both
data and algorithm. In this work we measure when applying kernel
methods to real datasets. For MNIST we find and for CIFAR10
, for both regression and classification tasks, and for
Gaussian or Laplace kernels. To rationalize the existence of non-trivial
exponents that can be independent of the specific kernel used, we study the
Teacher-Student framework for kernels. In this scheme, a Teacher generates data
according to a Gaussian random field, and a Student learns them via kernel
regression. With a simplifying assumption -- namely that the data are sampled
from a regular lattice -- we derive analytically for translation
invariant kernels, using previous results from the kriging literature. Provided
that the Student is not too sensitive to high frequencies, depends only
on the smoothness and dimension of the training data. We confirm numerically
that these predictions hold when the training points are sampled at random on a
hypersphere. Overall, the test error is found to be controlled by the magnitude
of the projection of the true function on the kernel eigenvectors whose rank is
larger than . Using this idea we predict relate the exponent to an
exponent describing how the coefficients of the true function in the
eigenbasis of the kernel decay with rank. We extract from real data by
performing kernel PCA, leading to for MNIST and
for CIFAR10, in good agreement with observations. We argue
that these rather large exponents are possible due to the small effective
dimension of the data.Comment: We added (i) the prediction of the exponent for real data
using kernel PCA; (ii) the generalization of our results to non-Gaussian data
from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in
Kernel Regression and Wide Neural Networks"
e3nn: Euclidean Neural Networks
We present e3nn, a generalized framework for creating E(3) equivariant
trainable functions, also known as Euclidean neural networks. e3nn naturally
operates on geometry and geometric tensors that describe systems in 3D and
transform predictably under a change of coordinate system. The core of e3nn are
equivariant operations such as the TensorProduct class or the spherical
harmonics functions that can be composed to create more complex modules such as
convolutions and attention mechanisms. These core operations of e3nn can be
used to efficiently articulate Tensor Field Networks, 3D Steerable CNNs,
Clebsch-Gordan Networks, SE(3) Transformers and other E(3) equivariant
networks.Comment: draf
A jamming transition from under- to over-parametrization affects loss landscape and generalization
We argue that in fully-connected networks a phase transition delimits the
over- and under-parametrized regimes where fitting can or cannot be achieved.
Under some general conditions, we show that this transition is sharp for the
hinge loss. In the whole over-parametrized regime, poor minima of the loss are
not encountered during training since the number of constraints to satisfy is
too small to hamper minimization. Our findings support a link between this
transition and the generalization properties of the network: as we increase the
number of parameters of a given model, starting from an under-parametrized
network, we observe that the generalization error displays three phases: (i)
initial decay, (ii) increase until the transition point --- where it displays a
cusp --- and (iii) slow decay toward a constant for the rest of the
over-parametrized regime. Thereby we identify the region where the classical
phenomenon of over-fitting takes place, and the region where the model keeps
improving, in line with previous empirical observations for modern neural
networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934
Jadovno i Ć aranova jama: Kontroverze i manipulacije [The Jadovno Concentration Camp and the Ć aran Pit: Controversies and Manipulations], (Zagreb: Hrvatski institut za povijest, 2017)
Summary of the book in English
Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning
Understanding when the noise in stochastic gradient descent (SGD) affects
generalization of deep neural networks remains a challenge, complicated by the
fact that networks can operate in distinct training regimes. Here we study how
the magnitude of this noise affects performance as the size of the training
set and the scale of initialization are varied. For gradient
descent, is a key parameter that controls if the network is
`lazy'() or instead learns features (). For
classification of MNIST and CIFAR10 images, our central results are: (i)
obtaining phase diagrams for performance in the plane. They show
that SGD noise can be detrimental or instead useful depending on the training
regime. Moreover, although increasing or decreasing both allow the
net to escape the lazy regime, these changes can have opposite effects on
performance. (ii) Most importantly, we find that the characteristic temperature
where the noise of SGD starts affecting the trained model (and eventually
performance) is a power law of . We relate this finding with the observation
that key dynamical quantities, such as the total variation of weights during
training, depend on both and as power laws. These results indicate that
a key effect of SGD noise occurs late in training by affecting the stopping
process whereby all data are fitted. Indeed, we argue that due to SGD noise,
nets must develop a stronger `signal', i.e. larger informative weights, to fit
the data, leading to a longer training time. A stronger signal and a longer
training time are also required when the size of the training set
increases. We confirm these views in the perceptron model, where signal and
noise can be precisely measured. Interestingly, exponents characterizing the
effect of SGD depend on the density of data near the decision boundary, as we
explain.Comment: 25 pages, 21 figures, added analysis in feature-learnin
Relative stability toward diffeomorphisms indicates performance in deep nets
Understanding why deep nets can classify data in large dimensions remains a
challenge. It has been proposed that they do so by becoming stable to
diffeomorphisms, yet existing empirical measurements support that it is often
not the case. We revisit this question by defining a maximum-entropy
distribution on diffeomorphisms, that allows to study typical diffeomorphisms
of a given norm. We confirm that stability toward diffeomorphisms does not
strongly correlate to performance on benchmark data sets of images. By
contrast, we find that the stability toward diffeomorphisms relative to that of
generic transformations correlates remarkably with the test error
. It is of order unity at initialization but decreases by several
decades during training for state-of-the-art architectures. For CIFAR10 and 15
known architectures, we find , suggesting that
obtaining a small is important to achieve good performance. We study how
depends on the size of the training set and compare it to a simple model
of invariant learning.Comment: NeurIPS 2021 Conferenc
Disentangling feature and lazy training in deep neural networks
Two distinct limits for deep learning have been derived as the network width
, depending on how the weights of the last layer scale
with . In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear
in the weights and is described by a frozen kernel . By contrast, in
the Mean-Field limit, the dynamics can be expressed in terms of the
distribution of the parameters associated with a neuron, that follows a partial
differential equation. In this work we consider deep networks where the weights
in the last layer scale as at initialization. By varying
and , we probe the crossover between the two limits. We observe the
previously identified regimes of lazy training and feature training. In the
lazy-training regime, the dynamics is almost linear and the NTK barely changes
after initialization. The feature-training regime includes the mean-field
formulation as a limiting case and is characterized by a kernel that evolves in
time, and learns some features. We perform numerical experiments on MNIST,
Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find
that (i) The two regimes are separated by an that scales as
. (ii) Network architecture and data structure play an important role
in determining which regime is better: in our tests, fully-connected networks
perform generally better in the lazy-training regime, unlike convolutional
networks. (iii) In both regimes, the fluctuations induced on the
learned function by initial conditions decay as ,
leading to a performance that increases with . The same improvement can also
be obtained at an intermediate width by ensemble-averaging several networks.
(iv) In the feature-training regime we identify a time scale
, such that for the dynamics is linear.Comment: minor revision
- âŠ