Search CORE

154 research outputs found

A General Theory of Equivariant CNNs on Homogeneous Spaces

Author: Cohen Taco
Geiger Mario
Weiler Maurice
Publication venue
Publication date: 01/01/2020
Field of study

We present a general theory of Group equivariant Convolutional Neural Networks (G-CNNs) on homogeneous spaces such as Euclidean space and the sphere. Feature maps in these networks represent fields on a homogeneous base space, and layers are equivariant maps between spaces of fields. The theory enables a systematic classification of all existing G-CNNs in terms of their symmetry group, base space, and field type. We also consider a fundamental question: what is the most general kind of equivariant linear map between feature spaces (fields) of given types? Following Mackey, we show that such maps correspond one-to-one with convolutions using equivariant kernels, and characterize the space of such kernels

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

Author: Geiger Mario
Spigler Stefano
Wyart Matthieu
Publication venue: 'IOP Publishing'
Publication date: 18/08/2020
Field of study

How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as

n^{-\beta}

where

n

is the number of training examples and

\beta

an exponent that depends on both data and algorithm. In this work we measure

\beta

when applying kernel methods to real datasets. For MNIST we find

\beta\approx 0.4

and for CIFAR10

\beta\approx 0.1

, for both regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we study the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption -- namely that the data are sampled from a regular lattice -- we derive analytically

\beta

for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies,

\beta

depends only on the smoothness and dimension of the training data. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, the test error is found to be controlled by the magnitude of the projection of the true function on the kernel eigenvectors whose rank is larger than

n

. Using this idea we predict relate the exponent

\beta

to an exponent

a

describing how the coefficients of the true function in the eigenbasis of the kernel decay with rank. We extract

a

from real data by performing kernel PCA, leading to

\beta\approx0.36

for MNIST and

\beta\approx0.07

for CIFAR10, in good agreement with observations. We argue that these rather large exponents are possible due to the small effective dimension of the data.Comment: We added (i) the prediction of the exponent

\beta

for real data using kernel PCA; (ii) the generalization of our results to non-Gaussian data from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks"

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

e3nn: Euclidean Neural Networks

Author: Geiger Mario
Smidt Tess
Publication venue
Publication date: 18/07/2022
Field of study

We present e3nn, a generalized framework for creating E(3) equivariant trainable functions, also known as Euclidean neural networks. e3nn naturally operates on geometry and geometric tensors that describe systems in 3D and transform predictably under a change of coordinate system. The core of e3nn are equivariant operations such as the TensorProduct class or the spherical harmonics functions that can be composed to create more complex modules such as convolutions and attention mechanisms. These core operations of e3nn can be used to efficiently articulate Tensor Field Networks, 3D Steerable CNNs, Clebsch-Gordan Networks, SE(3) Transformers and other E(3) equivariant networks.Comment: draf

arXiv.org e-Print Archive

A jamming transition from under- to over-parametrization affects loss landscape and generalization

Author: Biroli Giulio
d'Ascoli Stéphane
Geiger Mario
Sagun Levent
Spigler Stefano
Wyart Matthieu
Publication venue: 'IOP Publishing'
Publication date: 18/06/2019
Field of study

We argue that in fully-connected networks a phase transition delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. Under some general conditions, we show that this transition is sharp for the hinge loss. In the whole over-parametrized regime, poor minima of the loss are not encountered during training since the number of constraints to satisfy is too small to hamper minimization. Our findings support a link between this transition and the generalization properties of the network: as we increase the number of parameters of a given model, starting from an under-parametrized network, we observe that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point --- where it displays a cusp --- and (iii) slow decay toward a constant for the rest of the over-parametrized regime. Thereby we identify the region where the classical phenomenon of over-fitting takes place, and the region where the model keeps improving, in line with previous empirical observations for modern neural networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934

arXiv.org e-Print Archive

Hal-Diderot

Jadovno i Šaranova jama: Kontroverze i manipulacije [The Jadovno Concentration Camp and the Šaran Pit: Controversies and Manipulations], (Zagreb: Hrvatski institut za povijest, 2017)

Author: Davor Kovačić
Mario Jareb
Vladimir Geiger
Publication venue: 'Croatian Institute of History (Hrvatski Institut za Povijest)'
Publication date: 01/01/2017
Field of study

Summary of the book in English

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Author: Geiger Mario
Sclocchi Antonio
Wyart Matthieu
Publication venue
Publication date: 30/05/2023
Field of study

Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise

T

affects performance as the size of the training set

P

and the scale of initialization

\alpha

are varied. For gradient descent,

\alpha

is a key parameter that controls if the network is `lazy'(

\alpha\gg1

) or instead learns features (

\alpha\ll1

). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the

(\alpha,T)

plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing

T

or decreasing

\alpha

both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that the characteristic temperature

T_c

where the noise of SGD starts affecting the trained model (and eventually performance) is a power law of

P

. We relate this finding with the observation that key dynamical quantities, such as the total variation of weights during training, depend on both

T

and

P

as power laws. These results indicate that a key effect of SGD noise occurs late in training by affecting the stopping process whereby all data are fitted. Indeed, we argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. A stronger signal and a longer training time are also required when the size of the training set

P

increases. We confirm these views in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.Comment: 25 pages, 21 figures, added analysis in feature-learnin

arXiv.org e-Print Archive

Relative stability toward diffeomorphisms indicates performance in deep nets

Author: Favero Alessandro
Geiger Mario
Petrini Leonardo
Wyart Matthieu
Publication venue
Publication date: 04/11/2021
Field of study

Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations

R_f

correlates remarkably with the test error

\epsilon_t

. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures. For CIFAR10 and 15 known architectures, we find

\epsilon_t\approx 0.2\sqrt{R_f}

, suggesting that obtaining a small

R_f

is important to achieve good performance. We study how

R_f

depends on the size of the training set and compare it to a simple model of invariant learning.Comment: NeurIPS 2021 Conferenc

arXiv.org e-Print Archive

Disentangling feature and lazy training in deep neural networks

Author: Geiger Mario
Jacot Arthur
Spigler Stefano
Wyart Matthieu
Publication venue: 'IOP Publishing'
Publication date: 04/10/2020
Field of study

Two distinct limits for deep learning have been derived as the network width

h\rightarrow \infty

, depending on how the weights of the last layer scale with

h

. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel

\Theta

. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as

\alpha h^{-1/2}

at initialization. By varying

\alpha

and

h

, we probe the crossover between the two limits. We observe the previously identified regimes of lazy training and feature training. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that (i) The two regimes are separated by an

\alpha^*

that scales as

h^{-1/2}

. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations

\delta F

induced on the learned function by initial conditions decay as

\delta F\sim 1/\sqrt{h}

, leading to a performance that increases with

h

. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks. (iv) In the feature-training regime we identify a time scale

t_1\sim\sqrt{h}\alpha

, such that for

t\ll t_1

the dynamics is linear.Comment: minor revision

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne