212 research outputs found
HSIC Regularized LTSA
Hilbert-Schmidt Independence Criterion (HSIC) measures statistical independence between two random variables. However, instead of measuring the statistical independence between two random variables directly, HSIC first transforms two random variables into two Reproducing Kernel Hilbert Spaces (RKHS) respectively and then measures the kernelled random variables by using Hilbert-Schmidt (HS) operators between the two RKHS. Since HSIC was first proposed around 2005, HSIC has found wide applications in machine learning. In this paper, a HSIC regularized Local Tangent Space Alignment algorithm (HSIC-LTSA) is proposed. LTSA is a well-known dimensionality reduction algorithm for local homeomorphism preservation. In HSIC-LTSA, behind the objective function of LTSA, HSIC between high-dimensional and dimension-reduced data is added as a regularization term. The proposed HSIC-LTSA has two contributions. First, HSIC-LTSA implements local homeomorphism preservation and global statistical correlation during dimensionality reduction. Secondly, HSIC-LTSA proposes a new way to apply HSIC: HSIC is used as a regularization term to be added to other machine learning algorithms. The experimental results presented in this paper show that HSIC-LTSA can achieve better performance than the original LTSA
Kernel Methods and their derivatives: Concept and perspectives for the Earth system sciences
Kernel methods are powerful machine learning techniques which implement
generic non-linear functions to solve complex tasks in a simple way. They Have
a solid mathematical background and exhibit excellent performance in practice.
However, kernel machines are still considered black-box models as the feature
mapping is not directly accessible and difficult to interpret.The aim of this
work is to show that it is indeed possible to interpret the functions learned
by various kernel methods is intuitive despite their complexity. Specifically,
we show that derivatives of these functions have a simple mathematical
formulation, are easy to compute, and can be applied to many different
problems. We note that model function derivatives in kernel machines is
proportional to the kernel function derivative. We provide the explicit
analytic form of the first and second derivatives of the most common kernel
functions with regard to the inputs as well as generic formulas to compute
higher order derivatives. We use them to analyze the most used supervised and
unsupervised kernel learning methods: Gaussian Processes for regression,
Support Vector Machines for classification, Kernel Entropy Component Analysis
for density estimation, and the Hilbert-Schmidt Independence Criterion for
estimating the dependency between random variables. For all cases we expressed
the derivative of the learned function as a linear combination of the kernel
function derivative. Moreover we provide intuitive explanations through
illustrative toy examples and show how to improve the interpretation of real
applications in the context of spatiotemporal Earth system data cubes. This
work reflects on the observation that function derivatives may play a crucial
role in kernel methods analysis and understanding.Comment: 21 pages, 10 figures, PLOS One Journa
Deep networks training and generalization: insights from linearization
Bien qu'ils soient capables de représenter des fonctions très complexes, les réseaux de neurones profonds sont entraînés à l'aide de variations autour de la descente de gradient, un algorithme qui est basé sur une simple linéarisation de la fonction de coût à chaque itération lors de l'entrainement. Dans cette thèse, nous soutenons qu'une approche prometteuse pour élaborer une théorie générale qui expliquerait la généralisation des réseaux de neurones, est de s'inspirer d'une analogie avec les modèles linéaires, en étudiant le développement de Taylor au premier ordre qui relie des pas dans l'espace des paramètres à des modifications dans l'espace des fonctions.
Cette thèse par article comprend 3 articles ainsi qu'une bibliothèque logicielle. La bibliothèque NNGeometry (chapitre 3) sert de fil rouge à l'ensemble des projets, et introduit une Interface de Programmation Applicative (API) simple pour étudier la dynamique d'entrainement linéarisée de réseaux de neurones, en exploitant des méthodes récentes ainsi que de nouvelles accélérations algorithmiques. Dans l'article EKFAC (chapitre 4), nous proposons une approchée de la Matrice d'Information de Fisher (FIM), utilisée dans l'algorithme d'optimisation du gradient naturel. Dans l'article Lazy vs Hasty (chapitre 5), nous comparons la fonction obtenue par dynamique d'entrainement linéarisée (par exemple dans le régime limite du noyau tangent (NTK) à largeur infinie), au régime d'entrainement réel, en utilisant des groupes d'exemples classés selon différentes notions de difficulté. Dans l'article NTK alignment (chapitre 6), nous révélons un effet de régularisation implicite qui découle de l'alignement du NTK au noyau cible, au fur et à mesure que l'entrainement progresse.Despite being able to represent very complex functions, deep artificial neural networks are trained using variants of the basic gradient descent algorithm, which relies on linearization of the loss at each iteration during training. In this thesis, we argue that a promising way to tackle the challenge of elaborating a comprehensive theory explaining generalization in deep networks, is to take advantage of an analogy with linear models, by studying the first order Taylor expansion that maps parameter space updates to function space progress.
This thesis by publication is made of 3 papers and a software library. The library NNGeometry (chapter 3) serves as a common thread for all projects, and introduces a simple Application Programming Interface (API) to study the linearized training dynamics of deep networks using recent methods and contributed algorithmic accelerations. In the EKFAC paper (chapter 4), we propose an approximate to the Fisher Information Matrix (FIM), used in the natural gradient optimization algorithm. In the Lazy vs Hasty paper (chapter 5), we compare the function obtained while training using a linearized dynamics (e.g. in the infinite width Neural Tangent Kernel (NTK) limit regime), to the actual training regime, by means of examples grouped using different notions of difficulty. In the NTK alignment paper (chapter 6), we reveal an implicit regularization effect arising from the alignment of the NTK to the target kernel as training progresses
Non-Parametric Representation Learning with Kernels
Unsupervised and self-supervised representation learning has become popular
in recent years for learning useful features from unlabelled data.
Representation learning has been mostly developed in the neural network
literature, and other models for representation learning are surprisingly
unexplored. In this work, we introduce and analyze several kernel-based
representation learning approaches: Firstly, we define two kernel
Self-Supervised Learning (SSL) models using contrastive loss functions and
secondly, a Kernel Autoencoder (AE) model based on the idea of embedding and
reconstructing data. We argue that the classical representer theorems for
supervised kernel machines are not always applicable for (self-supervised)
representation learning, and present new representer theorems, which show that
the representations learned by our kernel models can be expressed in terms of
kernel matrices. We further derive generalisation error bounds for
representation learning with kernel SSL and AE, and empirically evaluate the
performance of these methods in both small data regimes as well as in
comparison with neural network based models
Layer-wise Learning of Kernel Dependence Networks
Due to recent debate over the biological plausibility of backpropagation
(BP), finding an alternative network optimization strategy has become an active
area of interest. We design a new type of kernel network, that is solved
greedily, to theoretically answer several questions of interest. First, if BP
is difficult to simulate in the brain, are there instead "trivial network
weights" (requiring minimum computation) that allow a greedily trained network
to classify any pattern. Perhaps a simple repetition of some basic rule can
yield a network equally powerful as ones trained by BP with Stochastic Gradient
Descent (SGD). Second, can a greedily trained network converge to a kernel?
What kernel will it converge to? Third, is this trivial solution optimal? How
is the optimal solution related to generalization? Lastly, can we theoretically
identify the network width and depth without a grid search? We prove that the
kernel embedding is the trivial solution that compels the greedy procedure to
converge to a kernel with Universal property. Yet, this trivial solution is not
even optimal. By obtaining the optimal solution spectrally, it provides insight
into the generalization of the network while informing us of the network width
and depth
Some phenomenological investigations in deep learning
Les remarquables performances des réseaux de neurones profonds dans de nombreux domaines de l'apprentissage automatique au cours de la dernière décennie soulèvent un certain nombre de questions théoriques. Par exemple, quels mecanismes permettent à ces reseaux, qui ont largement la capacité de mémoriser entièrement les exemples d'entrainement, de généraliser correctement à de nouvelles données, même en l'absence de régularisation explicite ? De telles questions ont fait l'objet d'intenses efforts de recherche ces dernières années, combinant analyses de systèmes simplifiés et études empiriques de propriétés qui semblent être corrélées à la performance de généralisation. Les deux premiers articles présentés dans cette thèse contribuent à cette ligne de recherche. Leur but est de mettre en évidence et d'etudier des mécanismes de biais implicites permettant à de larges modèles de prioriser l'apprentissage de fonctions "simples" et d'adapter leur capacité à la complexité du problème.
Le troisième article aborde le problème de l'estimation de information mutuelle en haute, en mettant à profit l'expressivité et la scalabilité des reseaux de neurones profonds. Il introduit et étudie une nouvelle classe d'estimateurs, dont il présente plusieurs applications en apprentissage non supervisé, notamment à l'amélioration des modèles neuronaux génératifs.The striking empirical success of deep neural networks in machine learning raises a number of theoretical puzzles. For example, why can they generalize to unseen data despite their capacity to fully memorize the training examples? Such puzzles have been the subject of intense research efforts in the past few years, which combine rigorous analysis of simplified systems with empirical studies of phenomenological properties shown to correlate with generalization. The first two articles presented in these thesis contribute to this line of work. They highlight and discuss mechanisms that allow large models to prioritize learning `simple' functions during training and to adapt their capacity to the complexity of the problem. The third article of this thesis addresses the long standing problem of estimating mutual information in high dimension, by leveraging the scalability of neural networks. It introduces and studies a new class of estimators and present several applications in unsupervised learning, especially on enhancing generative models
- …