31 research outputs found
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
Improving sampling, optimization and feature extraction in Boltzmann machines
Lâapprentissage supervisĂ© de rĂ©seaux hiĂ©rarchiques Ă grande Ă©chelle connaĂźt prĂ©sentement un succĂšs fulgurant. MalgrĂ© cette effervescence, lâapprentissage non-supervisĂ© reprĂ©sente toujours, selon plusieurs chercheurs, un Ă©lĂ©ment clĂ© de lâIntelligence Artificielle, oĂč les agents doivent apprendre Ă partir dâun nombre potentiellement limitĂ© de donnĂ©es. Cette thĂšse sâinscrit dans cette pensĂ©e et aborde divers sujets de recherche liĂ©s au problĂšme dâestimation de densitĂ© par lâentremise des machines de Boltzmann (BM), modĂšles graphiques probabilistes au coeur de lâapprentissage profond. Nos contributions touchent les domaines de lâĂ©chantillonnage, lâestimation de fonctions de partition, lâoptimisation ainsi que lâapprentissage de reprĂ©sentations invariantes.
Cette thĂšse dĂ©bute par lâexposition dâun nouvel algorithme d'Ă©chantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la tempĂ©rature des chaĂźnes de Markov sous simulation, afin de maintenir une vitesse de convergence Ă©levĂ©e tout au long de lâapprentissage. LorsquâutilisĂ© dans le contexte de lâapprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face Ă la sĂ©lection du taux dâapprentissage, ainsi quâune meilleure vitesse de convergence. Nos rĂ©sultats sont prĂ©sent Ìes dans le domaine des BMs, mais la mĂ©thode est gĂ©nĂ©rale et applicable Ă lâapprentissage de tout modĂšle probabiliste exploitant lâĂ©chantillonnage par chaĂźnes de Markov.
Tandis que le gradient du maximum de vraisemblance peut-ĂȘtre approximĂ© par Ă©chantillonnage, lâĂ©valuation de la log-vraisemblance nĂ©cessite un estimĂ© de la fonction de partition. Contrairement aux approches traditionnelles qui considĂšrent un modĂšle donnĂ© comme une boĂźte noire, nous proposons plutĂŽt dâexploiter la dynamique de lâapprentissage en estimant les changements successifs de log-partition encourus Ă chaque mise Ă jour des paramĂštres. Le problĂšme dâestimation est reformulĂ© comme un problĂšme dâinfĂ©rence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, oĂč les dimensions correspondent aux axes du temps et au paramĂštre de tempĂ©rature.
Sur le thĂšme de lâoptimisation, nous prĂ©sentons Ă©galement un algorithme permettant dâappliquer, de maniĂšre efficace, le gradient naturel Ă des machines de Boltzmann comportant des milliers dâunitĂ©s. JusquâĂ prĂ©sent, son adoption Ă©tait limitĂ©e par son haut coĂ»t computationel ainsi que sa demande en mĂ©moire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet dâĂ©viter le calcul explicite de la matrice dâinformation de Fisher (et son inverse) en exploitant un solveur linĂ©aire combinĂ© Ă un produit matrice-vecteur efficace. Lâalgorithme est prometteur: en terme du nombre dâĂ©valuations de fonctions, MFNG converge plus rapidement que SML. Son implĂ©mentation demeure malheureusement inefficace en temps de calcul.
Ces travaux explorent Ă©galement les mĂ©canismes sous-jacents Ă lâapprentissage de reprĂ©sentations invariantes. Ă cette fin, nous utilisons la famille de machines de Boltzmann restreintes âspike & slabâ (ssRBM), que nous modifions afin de pouvoir modĂ©liser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent ĂȘtre rendues invariantes Ă un sous-espace vectoriel, en associant Ă chacune dâelles, un vecteur de variables latentes continues (dĂ©nommĂ©es
âslabsâ). Ceci se traduit par une invariance accrue au niveau de la reprĂ©sentation et un meilleur taux de classification lorsque peu de donnĂ©es Ă©tiquetĂ©es sont disponibles. Nous terminons cette thĂšse sur un sujet ambitieux: lâapprentissage de reprĂ©sentations pouvant sĂ©parer les facteurs de variations prĂ©sents dans le signal dâentrĂ©e. Nous proposons une solution Ă base de ssRBM bilinĂ©aire (avec deux groupes de facteurs latents) et formulons le problĂšme comme lâun de âpoolingâ dans des sous-espaces vectoriels complĂ©mentaires.Despite the current widescale success of deep learning in training large scale hierarchical models through supervised learning, unsupervised learning promises to play a crucial role towards solving general Artificial Intelligence, where agents are expected to learn with little to no supervision. The work presented in this thesis tackles the problem of unsupervised feature learning and density estimation, using a model family at the heart of the deep learning phenomenon: the Boltzmann Machine (BM). We present contributions in the areas of sampling, partition function estimation, optimization and the more general topic of invariant feature learning.
With regards to sampling, we present a novel adaptive parallel tempering method which dynamically adjusts the temperatures under simulation to maintain good mixing in the presence of complex multi-modal distributions. When used in the context of stochastic maximum likelihood (SML) training, the improved ergodicity of our sampler translates to increased robustness to learning rates and faster per epoch convergence. Though our application is limited to BM, our method is general and is applicable to sampling from arbitrary probabilistic models using Markov Chain Monte Carlo (MCMC) techniques. While SML gradients can be estimated via sampling, computing data likelihoods requires an estimate of the partition function. Contrary to previous approaches which consider the model as a black box, we provide an efficient algorithm which instead tracks the change in the log partition function incurred by successive parameter updates. Our algorithm frames this estimation problem as one of filtering performed over a 2D lattice, with one dimension representing time and the other temperature.
On the topic of optimization, our thesis presents a novel algorithm for applying the natural gradient to large scale Boltzmann Machines. Up until now, its application had been constrained by the computational and memory requirements of computing the Fisher Information Matrix (FIM), which is square in the number of parameters. The Metric-Free Natural Gradient algorithm (MFNG) avoids computing the FIM altogether by combining a linear solver with an efficient matrix-vector operation. The method shows promise in that the resulting updates yield faster per-epoch convergence, despite being slower in terms of wall clock time.
Finally, we explore how invariant features can be learnt through modifications to the BM energy function. We study the problem in the context of the spike & slab Restricted Boltzmann Machine (ssRBM), which we extend to handle both binary and sparse input distributions. By associating each spike with several slab variables, latent variables can be made invariant to a rich, high dimensional subspace resulting in increased invariance in the learnt representation. When using
the expected model posterior as input to a classifier, increased invariance translates to improved classification accuracy in the low-label data regime. We conclude by showing a connection between invariance and the more powerful concept of disentangling factors of variation. While invariance can be achieved by pooling over subspaces, disentangling can be achieved by learning multiple complementary views of the same subspace. In particular, we show how this can be achieved using third-order BMs featuring multiplicative interactions between pairs of random variables
Deep Learning of Representations: Looking Forward
Deep learning research aims at discovering learning algorithms that discover
multiple levels of distributed representations, with higher levels representing
more abstract concepts. Although the study of deep learning has already led to
impressive theoretical results, learning algorithms and breakthrough
experiments, several challenges lie ahead. This paper proposes to examine some
of these challenges, centering on the questions of scaling deep learning
algorithms to much larger models and datasets, reducing optimization
difficulties due to ill-conditioning or local minima, designing more efficient
and powerful inference and sampling procedures, and learning to disentangle the
factors of variation underlying the observed data. It also proposes a few
forward-looking research directions aimed at overcoming these challenges