673 research outputs found
On the Inductive Bias of Neural Tangent Kernels
State-of-the-art neural networks are heavily over-parameterized, making the
optimization algorithm a crucial ingredient for learning predictive models with
good generalization properties. A recent line of work has shown that in a
certain over-parameterized regime, the learning dynamics of gradient descent
are governed by a certain kernel obtained at initialization, called the neural
tangent kernel. We study the inductive bias of learning in such a regime by
analyzing this kernel and the corresponding function space (RKHS). In
particular, we study smoothness, approximation, and stability properties of
functions with finite norm, including stability to image deformations in the
case of convolutional networks, and compare to other known kernels for similar
architectures.Comment: NeurIPS 201
A Convex Relaxation for Weakly Supervised Classifiers
This paper introduces a general multi-class approach to weakly supervised
classification. Inferring the labels and learning the parameters of the model
is usually done jointly through a block-coordinate descent algorithm such as
expectation-maximization (EM), which may lead to local minima. To avoid this
problem, we propose a cost function based on a convex relaxation of the
soft-max loss. We then propose an algorithm specifically designed to
efficiently solve the corresponding semidefinite program (SDP). Empirically,
our method compares favorably to standard ones on different datasets for
multiple instance learning and semi-supervised learning as well as on
clustering tasks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods
We study stochastic Cubic Newton methods for solving general possibly
non-convex minimization problems. We propose a new framework, which we call the
helper framework, that provides a unified view of the stochastic and
variance-reduced second-order algorithms equipped with global complexity
guarantees. It can also be applied to learning with auxiliary information. Our
helper framework offers the algorithm designer high flexibility for
constructing and analyzing the stochastic Cubic Newton methods, allowing
arbitrary size batches, and the use of noisy and possibly biased estimates of
the gradients and Hessians, incorporating both the variance reduction and the
lazy Hessian updates. We recover the best-known complexities for the stochastic
and variance-reduced Cubic Newton, under weak assumptions on the noise. A
direct consequence of our theory is the new lazy stochastic second-order
method, which significantly improves the arithmetic complexity for large
dimension problems. We also establish complexity bounds for the classes of
gradient-dominated objectives, that include convex and strongly convex
problems. For Auxiliary Learning, we show that using a helper (auxiliary
function) can outperform training alone if a given similarity measure is small
Disentangling feature and lazy training in deep neural networks
Two distinct limits for deep learning have been derived as the network width
, depending on how the weights of the last layer scale
with . In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear
in the weights and is described by a frozen kernel . By contrast, in
the Mean-Field limit, the dynamics can be expressed in terms of the
distribution of the parameters associated with a neuron, that follows a partial
differential equation. In this work we consider deep networks where the weights
in the last layer scale as at initialization. By varying
and , we probe the crossover between the two limits. We observe the
previously identified regimes of lazy training and feature training. In the
lazy-training regime, the dynamics is almost linear and the NTK barely changes
after initialization. The feature-training regime includes the mean-field
formulation as a limiting case and is characterized by a kernel that evolves in
time, and learns some features. We perform numerical experiments on MNIST,
Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find
that (i) The two regimes are separated by an that scales as
. (ii) Network architecture and data structure play an important role
in determining which regime is better: in our tests, fully-connected networks
perform generally better in the lazy-training regime, unlike convolutional
networks. (iii) In both regimes, the fluctuations induced on the
learned function by initial conditions decay as ,
leading to a performance that increases with . The same improvement can also
be obtained at an intermediate width by ensemble-averaging several networks.
(iv) In the feature-training regime we identify a time scale
, such that for the dynamics is linear.Comment: minor revision
Learning explanatory logical rules in non-linear domains: a neuro-symbolic approach
Deep neural networks, despite their capabilities, are constrained by the need for large-scale training data, and often fall short in generalisation and interpretability. Inductive logic programming (ILP) presents an intriguing solution with its data-efficient learning of first-order logic rules. However, ILP grapples with challenges, notably the handling of non-linearity in continuous domains. With the ascent of neuro-symbolic ILP, there’s a drive to mitigate these challenges, synergising deep learning with relational ILP models to enhance interpretability and create logical decision boundaries. In this research, we introduce a neuro-symbolic ILP framework, grounded on differentiable Neural Logic networks, tailored for non-linear rule extraction in mixed discrete-continuous spaces. Our methodology consists of a neuro-symbolic approach, emphasising the extraction of non-linear functions from mixed domain data. Our preliminary findings showcase our architecture’s capability to identify non-linear functions from continuous data, offering a new perspective in neural-symbolic research and underlining the adaptability of ILP-based frameworks for regression challenges in continuous scenarios
On Lazy Training in Differentiable Programming
International audienceIn a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths. Our numerical experiments bring a critical note, as we observe that the performance of commonly used non-linear deep convolutional neural networks in computer vision degrades when trained in the lazy regime. This makes it unlikely that "lazy training" is behind the many successes of neural networks in difficult high dimensional tasks
- …