    A General Theory of Equivariant CNNs on Homogeneous Spaces

    We present a general theory of Group equivariant Convolutional Neural Networks (G-CNNs) on homogeneous spaces such as Euclidean space and the sphere. Feature maps in these networks represent fields on a homogeneous base space, and layers are equivariant maps between spaces of fields. The theory enables a systematic classification of all existing G-CNNs in terms of their symmetry group, base space, and field type. We also consider a fundamental question: what is the most general kind of equivariant linear map between feature spaces (fields) of given types? Following Mackey, we show that such maps correspond one-to-one with convolutions using equivariant kernels, and characterize the space of such kernels

    Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

    How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as n−ÎČn^{-\beta} where nn is the number of training examples and ÎČ\beta an exponent that depends on both data and algorithm. In this work we measure ÎČ\beta when applying kernel methods to real datasets. For MNIST we find ÎČ≈0.4\beta\approx 0.4 and for CIFAR10 ÎČ≈0.1\beta\approx 0.1, for both regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we study the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption -- namely that the data are sampled from a regular lattice -- we derive analytically ÎČ\beta for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, ÎČ\beta depends only on the smoothness and dimension of the training data. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, the test error is found to be controlled by the magnitude of the projection of the true function on the kernel eigenvectors whose rank is larger than nn. Using this idea we predict relate the exponent ÎČ\beta to an exponent aa describing how the coefficients of the true function in the eigenbasis of the kernel decay with rank. We extract aa from real data by performing kernel PCA, leading to ÎČ≈0.36\beta\approx0.36 for MNIST and ÎČ≈0.07\beta\approx0.07 for CIFAR10, in good agreement with observations. We argue that these rather large exponents are possible due to the small effective dimension of the data.Comment: We added (i) the prediction of the exponent ÎČ\beta for real data using kernel PCA; (ii) the generalization of our results to non-Gaussian data from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks"

    e3nn: Euclidean Neural Networks

    We present e3nn, a generalized framework for creating E(3) equivariant trainable functions, also known as Euclidean neural networks. e3nn naturally operates on geometry and geometric tensors that describe systems in 3D and transform predictably under a change of coordinate system. The core of e3nn are equivariant operations such as the TensorProduct class or the spherical harmonics functions that can be composed to create more complex modules such as convolutions and attention mechanisms. These core operations of e3nn can be used to efficiently articulate Tensor Field Networks, 3D Steerable CNNs, Clebsch-Gordan Networks, SE(3) Transformers and other E(3) equivariant networks.Comment: draf

    A jamming transition from under- to over-parametrization affects loss landscape and generalization

    We argue that in fully-connected networks a phase transition delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. Under some general conditions, we show that this transition is sharp for the hinge loss. In the whole over-parametrized regime, poor minima of the loss are not encountered during training since the number of constraints to satisfy is too small to hamper minimization. Our findings support a link between this transition and the generalization properties of the network: as we increase the number of parameters of a given model, starting from an under-parametrized network, we observe that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point --- where it displays a cusp --- and (iii) slow decay toward a constant for the rest of the over-parametrized regime. Thereby we identify the region where the classical phenomenon of over-fitting takes place, and the region where the model keeps improving, in line with previous empirical observations for modern neural networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934

    Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

    Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise TT affects performance as the size of the training set PP and the scale of initialization α\alpha are varied. For gradient descent, α\alpha is a key parameter that controls if the network is `lazy'(α≫1\alpha\gg1) or instead learns features (αâ‰Ș1\alpha\ll1). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the (α,T)(\alpha,T) plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing TT or decreasing α\alpha both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that the characteristic temperature TcT_c where the noise of SGD starts affecting the trained model (and eventually performance) is a power law of PP. We relate this finding with the observation that key dynamical quantities, such as the total variation of weights during training, depend on both TT and PP as power laws. These results indicate that a key effect of SGD noise occurs late in training by affecting the stopping process whereby all data are fitted. Indeed, we argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. A stronger signal and a longer training time are also required when the size of the training set PP increases. We confirm these views in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.Comment: 25 pages, 21 figures, added analysis in feature-learnin

    Relative stability toward diffeomorphisms indicates performance in deep nets

    Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations RfR_f correlates remarkably with the test error Ï”t\epsilon_t. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures. For CIFAR10 and 15 known architectures, we find Ï”t≈0.2Rf\epsilon_t\approx 0.2\sqrt{R_f}, suggesting that obtaining a small RfR_f is important to achieve good performance. We study how RfR_f depends on the size of the training set and compare it to a simple model of invariant learning.Comment: NeurIPS 2021 Conferenc

    Disentangling feature and lazy training in deep neural networks

    Two distinct limits for deep learning have been derived as the network width h→∞h\rightarrow \infty, depending on how the weights of the last layer scale with hh. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ\Theta. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2\alpha h^{-1/2} at initialization. By varying α\alpha and hh, we probe the crossover between the two limits. We observe the previously identified regimes of lazy training and feature training. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that (i) The two regimes are separated by an α∗\alpha^* that scales as h−1/2h^{-1/2}. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations ÎŽF\delta F induced on the learned function by initial conditions decay as ÎŽF∌1/h\delta F\sim 1/\sqrt{h}, leading to a performance that increases with hh. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks. (iv) In the feature-training regime we identify a time scale t1∌hαt_1\sim\sqrt{h}\alpha, such that for tâ‰Șt1t\ll t_1 the dynamics is linear.Comment: minor revision
