23 research outputs found

    Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

    Full text link
    How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as nβn^{-\beta} where nn is the number of training examples and β\beta an exponent that depends on both data and algorithm. In this work we measure β\beta when applying kernel methods to real datasets. For MNIST we find β0.4\beta\approx 0.4 and for CIFAR10 β0.1\beta\approx 0.1, for both regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we study the Teacher-Student framework for kernels. In this scheme, a Teacher generates data according to a Gaussian random field, and a Student learns them via kernel regression. With a simplifying assumption -- namely that the data are sampled from a regular lattice -- we derive analytically β\beta for translation invariant kernels, using previous results from the kriging literature. Provided that the Student is not too sensitive to high frequencies, β\beta depends only on the smoothness and dimension of the training data. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, the test error is found to be controlled by the magnitude of the projection of the true function on the kernel eigenvectors whose rank is larger than nn. Using this idea we predict relate the exponent β\beta to an exponent aa describing how the coefficients of the true function in the eigenbasis of the kernel decay with rank. We extract aa from real data by performing kernel PCA, leading to β0.36\beta\approx0.36 for MNIST and β0.07\beta\approx0.07 for CIFAR10, in good agreement with observations. We argue that these rather large exponents are possible due to the small effective dimension of the data.Comment: We added (i) the prediction of the exponent β\beta for real data using kernel PCA; (ii) the generalization of our results to non-Gaussian data from reference [11] (Bordelon et al., "Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks"

    An analytic theory of shallow networks dynamics for hinge loss classification

    Full text link
    Neural networks have been shown to perform incredibly well in classification tasks over structured high-dimensional datasets. However, the learning dynamics of such networks is still poorly understood. In this paper we study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. We show that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. We specialize our theory to the prototypical case of a linearly separable dataset and a linear hinge loss, for which the dynamics can be explicitly solved. This allow us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting. Finally, we asses the limitations of mean-field theory by studying the case of large but finite number of nodes and of training samples.Comment: 16 pages, 6 figure

    Physics-informed radial basis network (PIRBN): A local approximation neural network for solving nonlinear PDEs

    Full text link
    Our recent intensive study has found that physics-informed neural networks (PINN) tend to be local approximators after training. This observation leads to this novel physics-informed radial basis network (PIRBN), which can maintain the local property throughout the entire training process. Compared to deep neural networks, a PIRBN comprises of only one hidden layer and a radial basis "activation" function. Under appropriate conditions, we demonstrated that the training of PIRBNs using gradient descendent methods can converge to Gaussian processes. Besides, we studied the training dynamics of PIRBN via the neural tangent kernel (NTK) theory. In addition, comprehensive investigations regarding the initialisation strategies of PIRBN were conducted. Based on numerical examples, PIRBN has been demonstrated to be more effective and efficient than PINN in solving PDEs with high-frequency features and ill-posed computational domains. Moreover, the existing PINN numerical techniques, such as adaptive learning, decomposition and different types of loss functions, are applicable to PIRBN. The programs that can regenerate all numerical results can be found at https://github.com/JinshuaiBai/PIRBN.Comment: 48 pages, 26 figure
    corecore