31 research outputs found

    Representation Learning: A Review and New Perspectives

    Full text link
    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning

    Improving sampling, optimization and feature extraction in Boltzmann machines

    Full text link
    L’apprentissage supervisĂ© de rĂ©seaux hiĂ©rarchiques Ă  grande Ă©chelle connaĂźt prĂ©sentement un succĂšs fulgurant. MalgrĂ© cette effervescence, l’apprentissage non-supervisĂ© reprĂ©sente toujours, selon plusieurs chercheurs, un Ă©lĂ©ment clĂ© de l’Intelligence Artificielle, oĂč les agents doivent apprendre Ă  partir d’un nombre potentiellement limitĂ© de donnĂ©es. Cette thĂšse s’inscrit dans cette pensĂ©e et aborde divers sujets de recherche liĂ©s au problĂšme d’estimation de densitĂ© par l’entremise des machines de Boltzmann (BM), modĂšles graphiques probabilistes au coeur de l’apprentissage profond. Nos contributions touchent les domaines de l’échantillonnage, l’estimation de fonctions de partition, l’optimisation ainsi que l’apprentissage de reprĂ©sentations invariantes. Cette thĂšse dĂ©bute par l’exposition d’un nouvel algorithme d'Ă©chantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la tempĂ©rature des chaĂźnes de Markov sous simulation, afin de maintenir une vitesse de convergence Ă©levĂ©e tout au long de l’apprentissage. Lorsqu’utilisĂ© dans le contexte de l’apprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face Ă  la sĂ©lection du taux d’apprentissage, ainsi qu’une meilleure vitesse de convergence. Nos rĂ©sultats sont prĂ©sent ́es dans le domaine des BMs, mais la mĂ©thode est gĂ©nĂ©rale et applicable Ă  l’apprentissage de tout modĂšle probabiliste exploitant l’échantillonnage par chaĂźnes de Markov. Tandis que le gradient du maximum de vraisemblance peut-ĂȘtre approximĂ© par Ă©chantillonnage, l’évaluation de la log-vraisemblance nĂ©cessite un estimĂ© de la fonction de partition. Contrairement aux approches traditionnelles qui considĂšrent un modĂšle donnĂ© comme une boĂźte noire, nous proposons plutĂŽt d’exploiter la dynamique de l’apprentissage en estimant les changements successifs de log-partition encourus Ă  chaque mise Ă  jour des paramĂštres. Le problĂšme d’estimation est reformulĂ© comme un problĂšme d’infĂ©rence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, oĂč les dimensions correspondent aux axes du temps et au paramĂštre de tempĂ©rature. Sur le thĂšme de l’optimisation, nous prĂ©sentons Ă©galement un algorithme permettant d’appliquer, de maniĂšre efficace, le gradient naturel Ă  des machines de Boltzmann comportant des milliers d’unitĂ©s. Jusqu’à prĂ©sent, son adoption Ă©tait limitĂ©e par son haut coĂ»t computationel ainsi que sa demande en mĂ©moire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet d’éviter le calcul explicite de la matrice d’information de Fisher (et son inverse) en exploitant un solveur linĂ©aire combinĂ© Ă  un produit matrice-vecteur efficace. L’algorithme est prometteur: en terme du nombre d’évaluations de fonctions, MFNG converge plus rapidement que SML. Son implĂ©mentation demeure malheureusement inefficace en temps de calcul. Ces travaux explorent Ă©galement les mĂ©canismes sous-jacents Ă  l’apprentissage de reprĂ©sentations invariantes. À cette fin, nous utilisons la famille de machines de Boltzmann restreintes “spike & slab” (ssRBM), que nous modifions afin de pouvoir modĂ©liser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent ĂȘtre rendues invariantes Ă  un sous-espace vectoriel, en associant Ă  chacune d’elles, un vecteur de variables latentes continues (dĂ©nommĂ©es “slabs”). Ceci se traduit par une invariance accrue au niveau de la reprĂ©sentation et un meilleur taux de classification lorsque peu de donnĂ©es Ă©tiquetĂ©es sont disponibles. Nous terminons cette thĂšse sur un sujet ambitieux: l’apprentissage de reprĂ©sentations pouvant sĂ©parer les facteurs de variations prĂ©sents dans le signal d’entrĂ©e. Nous proposons une solution Ă  base de ssRBM bilinĂ©aire (avec deux groupes de facteurs latents) et formulons le problĂšme comme l’un de “pooling” dans des sous-espaces vectoriels complĂ©mentaires.Despite the current widescale success of deep learning in training large scale hierarchical models through supervised learning, unsupervised learning promises to play a crucial role towards solving general Artificial Intelligence, where agents are expected to learn with little to no supervision. The work presented in this thesis tackles the problem of unsupervised feature learning and density estimation, using a model family at the heart of the deep learning phenomenon: the Boltzmann Machine (BM). We present contributions in the areas of sampling, partition function estimation, optimization and the more general topic of invariant feature learning. With regards to sampling, we present a novel adaptive parallel tempering method which dynamically adjusts the temperatures under simulation to maintain good mixing in the presence of complex multi-modal distributions. When used in the context of stochastic maximum likelihood (SML) training, the improved ergodicity of our sampler translates to increased robustness to learning rates and faster per epoch convergence. Though our application is limited to BM, our method is general and is applicable to sampling from arbitrary probabilistic models using Markov Chain Monte Carlo (MCMC) techniques. While SML gradients can be estimated via sampling, computing data likelihoods requires an estimate of the partition function. Contrary to previous approaches which consider the model as a black box, we provide an efficient algorithm which instead tracks the change in the log partition function incurred by successive parameter updates. Our algorithm frames this estimation problem as one of filtering performed over a 2D lattice, with one dimension representing time and the other temperature. On the topic of optimization, our thesis presents a novel algorithm for applying the natural gradient to large scale Boltzmann Machines. Up until now, its application had been constrained by the computational and memory requirements of computing the Fisher Information Matrix (FIM), which is square in the number of parameters. The Metric-Free Natural Gradient algorithm (MFNG) avoids computing the FIM altogether by combining a linear solver with an efficient matrix-vector operation. The method shows promise in that the resulting updates yield faster per-epoch convergence, despite being slower in terms of wall clock time. Finally, we explore how invariant features can be learnt through modifications to the BM energy function. We study the problem in the context of the spike & slab Restricted Boltzmann Machine (ssRBM), which we extend to handle both binary and sparse input distributions. By associating each spike with several slab variables, latent variables can be made invariant to a rich, high dimensional subspace resulting in increased invariance in the learnt representation. When using the expected model posterior as input to a classifier, increased invariance translates to improved classification accuracy in the low-label data regime. We conclude by showing a connection between invariance and the more powerful concept of disentangling factors of variation. While invariance can be achieved by pooling over subspaces, disentangling can be achieved by learning multiple complementary views of the same subspace. In particular, we show how this can be achieved using third-order BMs featuring multiplicative interactions between pairs of random variables

    Deep Learning of Representations: Looking Forward

    Full text link
    Deep learning research aims at discovering learning algorithms that discover multiple levels of distributed representations, with higher levels representing more abstract concepts. Although the study of deep learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, several challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward-looking research directions aimed at overcoming these challenges
    corecore