106 research outputs found
Denoising OCT Images Using Steered Mixture of Experts with Multi-Model Inference
In Optical Coherence Tomography (OCT), speckle noise significantly hampers
image quality, affecting diagnostic accuracy. Current methods, including
traditional filtering and deep learning techniques, have limitations in noise
reduction and detail preservation. Addressing these challenges, this study
introduces a novel denoising algorithm, Block-Matching Steered-Mixture of
Experts with Multi-Model Inference and Autoencoder (BM-SMoE-AE). This method
combines block-matched implementation of the SMoE algorithm with an enhanced
autoencoder architecture, offering efficient speckle noise reduction while
retaining critical image details. Our method stands out by providing improved
edge definition and reduced processing time. Comparative analysis with existing
denoising techniques demonstrates the superior performance of BM-SMoE-AE in
maintaining image integrity and enhancing OCT image usability for medical
diagnostics.Comment: This submission contains 10 pages and 4 figures. It was presented at
the 2024 SPIE Photonics West, held in San Francisco. The paper details
advancements in photonics applications related to healthcare and includes
supplementary material with additional datasets for revie
Steered mixture-of-experts for light field images and video : representation and coding
Research in light field (LF) processing has heavily increased over the last decade. This is largely driven by the desire to achieve the same level of immersion and navigational freedom for camera-captured scenes as it is currently available for CGI content. Standardization organizations such as MPEG and JPEG continue to follow conventional coding paradigms in which viewpoints are discretely represented on 2-D regular grids. These grids are then further decorrelated through hybrid DPCM/transform techniques. However, these 2-D regular grids are less suited for high-dimensional data, such as LFs. We propose a novel coding framework for higher-dimensional image modalities, called Steered Mixture-of-Experts (SMoE). Coherent areas in the higher-dimensional space are represented by single higher-dimensional entities, called kernels. These kernels hold spatially localized information about light rays at any angle arriving at a certain region. The global model consists thus of a set of kernels which define a continuous approximation of the underlying plenoptic function. We introduce the theory of SMoE and illustrate its application for 2-D images, 4-D LF images, and 5-D LF video. We also propose an efficient coding strategy to convert the model parameters into a bitstream. Even without provisions for high-frequency information, the proposed method performs comparable to the state of the art for low-to-mid range bitrates with respect to subjective visual quality of 4-D LF images. In case of 5-D LF video, we observe superior decorrelation and coding performance with coding gains of a factor of 4x in bitrate for the same quality. At least equally important is the fact that our method inherently has desired functionality for LF rendering which is lacking in other state-of-the-art techniques: (1) full zero-delay random access, (2) light-weight pixel-parallel view reconstruction, and (3) intrinsic view interpolation and super-resolution
Adapting Computer Vision Models To Limitations On Input Dimensionality And Model Complexity
When considering instances of distributed systems where visual sensors communicate with remote predictive models, data traffic is limited to the capacity of communication channels, and hardware limits the processing of collected data prior to transmission. We study novel methods of adapting visual inference to limitations on complexity and data availability at test time, wherever the aforementioned limitations exist. Our contributions detailed in this thesis consider both task-specific and task-generic approaches to reducing the data requirement for inference, and evaluate our proposed methods on a wide range of computer vision tasks. This thesis makes four distinct contributions: (i) We investigate multi-class action classification via two-stream convolutional neural networks that directly ingest information extracted from compressed video bitstreams. We show that selective access to macroblock motion vector information provides a good low-dimensional approximation of the underlying optical flow in visual sequences. (ii) We devise a bitstream cropping method by which AVC/H.264 and H.265 bitstreams are reduced to the minimum amount of necessary elements for optical flow extraction, while maintaining compliance with codec standards. We additionally study the effect of codec rate-quality control on the sparsity and noise incurred on optical flow derived from resulting bitstreams, and do so for multiple coding standards. (iii) We demonstrate degrees of variability in the amount of data required for action classification, and leverage this to reduce the dimensionality of input volumes by inferring the required temporal extent for accurate classification prior to processing via learnable machines. (iv) We extend the Mixtures-of-Experts (MoE) paradigm to adapt the data cost of inference for any set of constituent experts. We postulate that the minimum acceptable data cost of inference varies for different input space partitions, and consider mixtures where each expert is designed to meet a different set of constraints on input dimensionality. To take advantage of the flexibility of such mixtures in processing different input representations and modalities, we train biased gating functions such that experts requiring less information to make their inferences are favoured to others. We finally note that, our proposed data utility optimization solutions include a learnable component which considers specified priorities on the amount of information to be used prior to inference, and can be realized for any combination of tasks, modalities, and constraints on available data
A Machine Learning Approach to Indoor Localization Data Mining
Indoor positioning systems are increasingly commonplace in various environments and
produce large quantities of data. They are used in industrial applications, robotics,
asset and employee tracking just to name a few use cases. The growing amount of data
and the accelerating progress of machine learning opens up many new possibilities for
analyzing this data in ways that were not conceivable or relevant before. This paper
introduces connected concepts and implementations to answer question how this data
can be utilized. Data gathered in this thesis originates from an indoor positioning system
deployed in retail environment, but the discussed methods can be applied generally.
The issue will be approached by first introducing the concept of machine learning
and more generally, artificial intelligence, and how they work on a general level. A
deeper dive is done to subfields and algorithms that are relevant to the data mining task
at hand. Indoor positioning system basics are also shortly discussed to create a base understanding
on the realistic capabilities and constraints that these kinds of systems encase.
These methods and previous knowledge from literature are put to test with the
freshly gathered data. An algorithm based on existing example from literature was tested
and improved upon with the new data. A novel method to cluster and classify movement
patterns was introduced, utilizing deep learning to create embedded representations of the
trajectories in a more complex learning pipeline. This type of learning is often referred
to as deep clustering.
The results are promising and both of the methods produce useful high level representations
of the complex dataset that can help a human operator to discern the
relevant patterns from raw data and to be used as an input for subsequent supervised and
unsupervised learning steps. Several factors related to optimizing the learning pipeline,
such as regularization were also researched and the results presented as visualizations.
The research found that pipeline consisting of CNN-autoencoder followed by a classic
clustering algorithm such as DBSCAN produces useful results in the form of trajectory
clusters. Regularization such as L1 regression improves this performance.
The research done in this paper presents useful algorithms for processing raw, noisy
localization data from indoor environments that can be used for further implementations
in both industrial applications and academia
Feedforward deep architectures for classification and synthesis
Cette thÚse par article présente plusieurs contributions au domaine de l'apprentissage de représentations profondes, avec des applications aux problÚmes de classification et de synthÚse d'images naturelles. Plus spécifiquement, cette thÚse présente plusieurs nouvelles techniques pour la construction et l'entraßnment de réseaux neuronaux profonds, ainsi q'une étude empirique de la technique de «dropout», une des approches de régularisation les plus populaires des derniÚres années.
Le premier article présente une nouvelle fonction d'activation linéaire par morceau, appellée «maxout», qui permet à chaque unité cachée d'un réseau de neurones d'apprendre sa propre fonction d'activation convexe. Nous démontrons une performance améliorée sur plusieurs tùches d'évaluation du domaine de reconnaissance d'objets, et nous examinons empiriquement les sources de cette amélioration, y compris une meilleure synergie avec la méthode de régularisation «dropout» récemment proposée.
Le second article poursuit l'examen de la technique «dropout». Nous nous concentrons sur les réseaux avec fonctions d'activation rectifiées linéaires (ReLU) et répondons empiriquement à plusieurs questions concernant l'efficacité remarquable de «dropout» en tant que régularisateur, incluant les questions portant sur la méthode rapide de rééchelonnement au temps de lŽévaluation et la moyenne géometrique que cette méthode approxime, l'interprétation d'ensemble comparée aux ensembles traditionnels, et l'importance d'employer des critÚres similaires au «bagging» pour l'optimisation.
Le troisiÚme article s'intéresse à un problÚme pratique de l'application à l'échelle industrielle de réseaux neuronaux profonds au problÚme de reconnaissance d'objets avec plusieurs etiquettes, nommément l'amélioration de la capacité d'un modÚle à discriminer entre des étiquettes fréquemment confondues. Nous résolvons le problÚme en employant la prédiction du réseau des sous-composantes dédiées à chaque sous-ensemble de la partition.
Finalement, le quatriÚme article s'attaque au problÚme de l'entraßnment de modÚles génératifs adversariaux (GAN) récemment proposé. Nous présentons une procédure d'entraßnment améliorée employant un auto-encodeur débruitant, entraßné dans un espace caractéristiques abstrait appris par le discriminateur, pour guider le générateur à apprendre un encodage qui s'aligne de plus prÚs aux données. Nous évaluons le modÚle avec le score «Inception» récemment proposé.This thesis by articles makes several contributions to the field of deep learning, with applications to both classification and synthesis of natural images. Specifically, we introduce several new techniques for the construction and training of deep feedforward networks, and present an empirical investigation into dropout, one of the most popular regularization strategies of the last several years.
In the first article, we present a novel piece-wise linear parameterization of neural networks, maxout, which allows each hidden unit of a neural network to effectively learn its own convex activation function. We demonstrate improvements on several object recognition benchmarks, and empirically investigate the source of these improvements, including an improved synergy with the recently proposed dropout regularization method.
In the second article, we further interrogate the dropout algorithm in particular. Focusing on networks of the popular rectified linear units (ReLU), we empirically examine several questions regarding dropoutâs remarkable effectiveness as a regularizer, including questions surrounding the fast test-time rescaling trick and the geometric mean it approximates, interpretations as an ensemble as compared with traditional ensembles, and the importance of using a bagging-like criterion for optimization.
In the third article, we address a practical problem in industrial-scale application of deep networks for multi-label object recognition, namely improving an existing modelâs ability to discriminate between frequently confused classes. We accomplish this by using the networkâs own predictions to inform a partitioning of the label space, and augment the network with dedicated discriminative capacity addressing each of the partitions.
Finally, in the fourth article, we tackle the problem of fitting implicit generative models of open domain collections of natural images using the recently introduced Generative Adversarial Networks (GAN) paradigm. We introduce an augmented training procedure which employs a denoising autoencoder, trained in a high-level feature space learned by the discriminator, to guide the generator towards feature encodings which more closely resemble the data. We quantitatively evaluate our findings using the recently proposed Inception score
Learning continuous models for estimating intrinsic component images
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Also issued in pages.MIT Rotch Library copy: issued in pages.Includes bibliographical references (leaves 137-144).The goal of computer vision is to use an image to recover the characteristics of a scene, such as its shape or illumination. This is difficult because an image is the mixture of multiple characteristics. For example, an edge in an image could be caused by either an edge on a surface or a change in the surface's color. Distinguishing the effects of different scene characteristics is an important step towards high-level analysis of an image. This thesis describes how to use machine learning to build a system that recovers different characteristics of the scene from a single, gray-scale image of the scene. The goal of the system is to use the observed image to recover images, referred to as Intrinsic Component Images, that represent the scene's characteristics. The development of the system is focused on estimating two important characteristics of a scene, its shading and reflectance, from a single image. From the observed image, the system estimates a shading image, which captures the interaction of the illumination and shape of the scene pictured, and an albedo image, which represents how the surfaces in the image reflect light. Measured both qualitatively and quantitatively, this system produces state-of-the-art estimates of shading and albedo images.(cont.) This system is also flexible enough to be used for the separate problem of removing noise from an image. Building this system requires algorithms for continuous regression and learning the parameters of a Conditionally Gaussian Markov Random Field. Unlike previous work, this system is trained using real-world surfaces with ground-truth shading and albedo images. The learning algorithms are designed to accommodate the large amount of data in this training set.by Marshall Friend Tappen.Ph.D
Improving sampling, optimization and feature extraction in Boltzmann machines
Lâapprentissage supervisĂ© de rĂ©seaux hiĂ©rarchiques Ă grande Ă©chelle connaĂźt prĂ©sentement un succĂšs fulgurant. MalgrĂ© cette effervescence, lâapprentissage non-supervisĂ© reprĂ©sente toujours, selon plusieurs chercheurs, un Ă©lĂ©ment clĂ© de lâIntelligence Artificielle, oĂč les agents doivent apprendre Ă partir dâun nombre potentiellement limitĂ© de donnĂ©es. Cette thĂšse sâinscrit dans cette pensĂ©e et aborde divers sujets de recherche liĂ©s au problĂšme dâestimation de densitĂ© par lâentremise des machines de Boltzmann (BM), modĂšles graphiques probabilistes au coeur de lâapprentissage profond. Nos contributions touchent les domaines de lâĂ©chantillonnage, lâestimation de fonctions de partition, lâoptimisation ainsi que lâapprentissage de reprĂ©sentations invariantes.
Cette thĂšse dĂ©bute par lâexposition dâun nouvel algorithme d'Ă©chantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la tempĂ©rature des chaĂźnes de Markov sous simulation, afin de maintenir une vitesse de convergence Ă©levĂ©e tout au long de lâapprentissage. LorsquâutilisĂ© dans le contexte de lâapprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face Ă la sĂ©lection du taux dâapprentissage, ainsi quâune meilleure vitesse de convergence. Nos rĂ©sultats sont prĂ©sent Ìes dans le domaine des BMs, mais la mĂ©thode est gĂ©nĂ©rale et applicable Ă lâapprentissage de tout modĂšle probabiliste exploitant lâĂ©chantillonnage par chaĂźnes de Markov.
Tandis que le gradient du maximum de vraisemblance peut-ĂȘtre approximĂ© par Ă©chantillonnage, lâĂ©valuation de la log-vraisemblance nĂ©cessite un estimĂ© de la fonction de partition. Contrairement aux approches traditionnelles qui considĂšrent un modĂšle donnĂ© comme une boĂźte noire, nous proposons plutĂŽt dâexploiter la dynamique de lâapprentissage en estimant les changements successifs de log-partition encourus Ă chaque mise Ă jour des paramĂštres. Le problĂšme dâestimation est reformulĂ© comme un problĂšme dâinfĂ©rence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, oĂč les dimensions correspondent aux axes du temps et au paramĂštre de tempĂ©rature.
Sur le thĂšme de lâoptimisation, nous prĂ©sentons Ă©galement un algorithme permettant dâappliquer, de maniĂšre efficace, le gradient naturel Ă des machines de Boltzmann comportant des milliers dâunitĂ©s. JusquâĂ prĂ©sent, son adoption Ă©tait limitĂ©e par son haut coĂ»t computationel ainsi que sa demande en mĂ©moire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet dâĂ©viter le calcul explicite de la matrice dâinformation de Fisher (et son inverse) en exploitant un solveur linĂ©aire combinĂ© Ă un produit matrice-vecteur efficace. Lâalgorithme est prometteur: en terme du nombre dâĂ©valuations de fonctions, MFNG converge plus rapidement que SML. Son implĂ©mentation demeure malheureusement inefficace en temps de calcul.
Ces travaux explorent Ă©galement les mĂ©canismes sous-jacents Ă lâapprentissage de reprĂ©sentations invariantes. Ă cette fin, nous utilisons la famille de machines de Boltzmann restreintes âspike & slabâ (ssRBM), que nous modifions afin de pouvoir modĂ©liser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent ĂȘtre rendues invariantes Ă un sous-espace vectoriel, en associant Ă chacune dâelles, un vecteur de variables latentes continues (dĂ©nommĂ©es
âslabsâ). Ceci se traduit par une invariance accrue au niveau de la reprĂ©sentation et un meilleur taux de classification lorsque peu de donnĂ©es Ă©tiquetĂ©es sont disponibles. Nous terminons cette thĂšse sur un sujet ambitieux: lâapprentissage de reprĂ©sentations pouvant sĂ©parer les facteurs de variations prĂ©sents dans le signal dâentrĂ©e. Nous proposons une solution Ă base de ssRBM bilinĂ©aire (avec deux groupes de facteurs latents) et formulons le problĂšme comme lâun de âpoolingâ dans des sous-espaces vectoriels complĂ©mentaires.Despite the current widescale success of deep learning in training large scale hierarchical models through supervised learning, unsupervised learning promises to play a crucial role towards solving general Artificial Intelligence, where agents are expected to learn with little to no supervision. The work presented in this thesis tackles the problem of unsupervised feature learning and density estimation, using a model family at the heart of the deep learning phenomenon: the Boltzmann Machine (BM). We present contributions in the areas of sampling, partition function estimation, optimization and the more general topic of invariant feature learning.
With regards to sampling, we present a novel adaptive parallel tempering method which dynamically adjusts the temperatures under simulation to maintain good mixing in the presence of complex multi-modal distributions. When used in the context of stochastic maximum likelihood (SML) training, the improved ergodicity of our sampler translates to increased robustness to learning rates and faster per epoch convergence. Though our application is limited to BM, our method is general and is applicable to sampling from arbitrary probabilistic models using Markov Chain Monte Carlo (MCMC) techniques. While SML gradients can be estimated via sampling, computing data likelihoods requires an estimate of the partition function. Contrary to previous approaches which consider the model as a black box, we provide an efficient algorithm which instead tracks the change in the log partition function incurred by successive parameter updates. Our algorithm frames this estimation problem as one of filtering performed over a 2D lattice, with one dimension representing time and the other temperature.
On the topic of optimization, our thesis presents a novel algorithm for applying the natural gradient to large scale Boltzmann Machines. Up until now, its application had been constrained by the computational and memory requirements of computing the Fisher Information Matrix (FIM), which is square in the number of parameters. The Metric-Free Natural Gradient algorithm (MFNG) avoids computing the FIM altogether by combining a linear solver with an efficient matrix-vector operation. The method shows promise in that the resulting updates yield faster per-epoch convergence, despite being slower in terms of wall clock time.
Finally, we explore how invariant features can be learnt through modifications to the BM energy function. We study the problem in the context of the spike & slab Restricted Boltzmann Machine (ssRBM), which we extend to handle both binary and sparse input distributions. By associating each spike with several slab variables, latent variables can be made invariant to a rich, high dimensional subspace resulting in increased invariance in the learnt representation. When using
the expected model posterior as input to a classifier, increased invariance translates to improved classification accuracy in the low-label data regime. We conclude by showing a connection between invariance and the more powerful concept of disentangling factors of variation. While invariance can be achieved by pooling over subspaces, disentangling can be achieved by learning multiple complementary views of the same subspace. In particular, we show how this can be achieved using third-order BMs featuring multiplicative interactions between pairs of random variables
Deep learning of representations and its application to computer vision
Lâobjectif de cette thĂšse par articles est de prĂ©senter modestement quelques Ă©tapes du parcours qui mĂšnera (on espĂšre) Ă une solution gĂ©nĂ©rale du problĂšme de lâintelligence artificielle. Cette thĂšse contient quatre articles qui prĂ©sentent chacun une diffĂ©rente nouvelle mĂ©thode dâinfĂ©rence perceptive en utilisant lâapprentissage machine et, plus particuliĂšrement, les rĂ©seaux neuronaux profonds. Chacun de ces documents met en Ă©vidence lâutilitĂ© de sa mĂ©thode proposĂ©e dans le cadre dâune tĂąche de vision par ordinateur. Ces mĂ©thodes sont applicables dans un contexte plus gĂ©nĂ©ral, et dans certains cas elles on tĂ©tĂ© appliquĂ©es ailleurs, mais ceci ne sera pas abordĂ© dans le contexte de cette de thĂšse.
Dans le premier article, nous prĂ©sentons deux nouveaux algorithmes dâinfĂ©rence variationelle pour le modĂšle gĂ©nĂ©ratif dâimages appelĂ© codage parcimonieux âspike- and-slabâ (CPSS). Ces mĂ©thodes dâinfĂ©rence plus rapides nous permettent dâutiliser des modĂšles CPSS de tailles beaucoup plus grandes quâauparavant. Nous dĂ©montrons quâelles sont meilleures pour extraire des dĂ©tecteur de caractĂ©ristiques quand trĂšs peu dâexemples Ă©tiquetĂ©s sont disponibles pour lâentraĂźnement. Partant dâun modĂšle CPSS, nous construisons ensuite une architecture profonde, la machine de Boltzmann profonde partiellement dirigĂ©e (MBP-PD). Ce modĂšle a Ă©tĂ© conçu de maniĂšre Ă simplifier dâentraĂźnement des machines de Boltzmann profondes qui nĂ©cessitent normalement une phase de prĂ©-entraĂźnement glouton pour chaque couche. Ce problĂšme est rĂ©glĂ© dans une certaine mesure, mais le coĂ»t dâinfĂ©rence dans le nouveau modĂšle est relativement trop Ă©levĂ© pour permettre de lâutiliser de maniĂšre pratique.
Dans le deuxiĂšme article, nous revenons au problĂšme dâentraĂźnement joint de machines de Boltzmann profondes. Cette fois, au lieu de changer de famille de modĂšles, nous introduisons un nouveau critĂšre dâentraĂźnement qui donne naissance aux machines de Boltzmann profondes Ă multiples prĂ©dictions (MBP-MP). Les MBP-MP sont entraĂźnables en une seule Ă©tape et ont un meilleur taux de succĂšs en classification que les MBP classiques. Elles sâentraĂźnent aussi avec des mĂ©thodes variationelles standard au lieu de nĂ©cessiter un classificateur discriminant pour obtenir un bon taux de succĂšs en classification. Par contre, un des inconvĂ©nients de tels modĂšles est leur incapacitĂ© de gĂ©nĂ©rer desĂ©chantillons, mais ceci nâest pas trop grave puisque la performance de classification des machines de Boltzmann profondes nâest plus une prioritĂ© Ă©tant donnĂ© les derniĂšres avancĂ©es en apprentissage supervisĂ©. MalgrĂ© cela, les MBP-MP demeurent intĂ©ressantes parce quâelles sont capable dâaccomplir certaines tĂąches que des modĂšles purement supervisĂ©s ne peuvent pas faire, telles que celle de classifier des donnĂ©es incomplĂštes ou encore celle de combler intelligemment lâinformation manquante dans ces donnĂ©es incomplĂštes.
ïżŒLe travail prĂ©sentĂ© dans cette thĂšse sâest dĂ©roulĂ© au milieu dâune pĂ©riode de transformations importantes du domaine de lâapprentissage Ă rĂ©seaux neuronaux profonds qui a Ă©tĂ© dĂ©clenchĂ©e par la dĂ©couverte de lâalgorithme de âdropoutâ par Geoffrey Hinton. Dropout rend possible un entraĂźnement purement supervisĂ© dâarchitectures de propagation unidirectionnel sans ĂȘtre exposĂ© au danger de sur- entraĂźnement. Le troisiĂšme article prĂ©sentĂ© dans cette thĂšse introduit une nouvelle fonction dâactivation spĂ©cialement con ̧cue pour aller avec lâalgorithme de Dropout. Cette fonction dâactivation, appelĂ©e maxout, permet lâutilisation de aggrĂ©gation multi-canal dans un contexte dâapprentissage purement supervisĂ©. Nous dĂ©montrons comment plusieurs tĂąches de reconnaissance dâobjets sont mieux accomplies par lâutilisation de maxout.
Pour terminer, sont prĂ©sentons un vrai cas dâutilisation dans lâindustrie pour la transcription dâadresses de maisons Ă plusieurs chiffres. En combinant maxout avec une nouvelle sorte de couche de sortie pour des rĂ©seaux neuronaux de convolution, nous dĂ©montrons quâil est possible dâatteindre un taux de succĂšs comparable Ă celui des humains sur un ensemble de donnĂ©es coriace constituĂ© de photos prises par les voitures de Google. Ce systĂšme a Ă©tĂ© dĂ©ployĂ© avec succĂšs chez Google pour lire environ cent million dâadresses de maisons.The goal of this thesis is to present a few small steps along the road to solving general artificial intelligence. This is a thesis by articles containing four articles. Each of these articles presents a new method for performing perceptual inference using machine learning and deep architectures. Each of these papers demonstrates the utility of the proposed method in the context of a computer vision task. The methods are more generally applicable and in some cases have been applied to other kinds of tasks, but this thesis does not explore such applications. In the first article, we present two fast new variational inference algorithms for a generative model of images known as spike-and-slab sparse coding (S3C). These faster inference algorithms allow us to scale spike-and-slab sparse coding to unprecedented problem sizes and show that it is a superior feature extractor for object recognition tasks when very few labeled examples are available. We then build a new deep architecture, the partially-directed deep Boltzmann machine (PD- DBM) on top of the S3C model. This model was designed to simplify the training procedure for deep Boltzmann machines, which previously required a greedy layer-wise pretraining procedure. This model partially succeeds at solving this problem, but the cost of inference in the new model is high enough that it makes scaling the model to serious applications difficult. In the second article, we revisit the problem of jointly training deep Boltzmann machines. This time, rather than changing the model family, we present a new training criterion, resulting in multi-prediction deep Boltzmann machines (MP- DBMs). MP-DBMs may be trained in a single stage and obtain better classification accuracy than traditional DBMs. They also are able to classify well using standard variational inference techniques, rather than requiring a separate, specialized, discriminatively trained classifier to obtain good classification performance. However, this comes at the cost of the model not being able to generate good samples. The classification performance of deep Boltzmann machines is no longer especially interesting following recent advances in supervised learning, but the MP-DBM remains interesting because it can perform tasks that purely supervised models cannot, such as classification in the presence of missing inputs and imputation of missing inputs. The general zeitgeist of deep learning research changed dramatically during the midst of the work on this thesis with the introduction of Geoffrey Hintonâs dropout algorithm. Dropout permits purely supervised training of feedforward architectures with little overfitting. The third paper in this thesis presents a new activation function for feedforward neural networks which was explicitly designed to work well with dropout. This activation function, called maxout, makes it possible to learn architectures that leverage the benefits of cross-channel pooling in a purely ïżŒsupervised manner. We demonstrate improvements on several object recognition tasks using this activation function. Finally, we solve a real world task: transcription of photos of multi-digit house numbers for geo-coding. Using maxout units and a new kind of output layer for convolutional neural networks, we demonstrate human level accuracy (with limited coverage) on a challenging real-world dataset. This system has been deployed at Google and successfully used to transcribe nearly 100 million house numbers
- âŠ