106 research outputs found

    Denoising OCT Images Using Steered Mixture of Experts with Multi-Model Inference

    Full text link
    In Optical Coherence Tomography (OCT), speckle noise significantly hampers image quality, affecting diagnostic accuracy. Current methods, including traditional filtering and deep learning techniques, have limitations in noise reduction and detail preservation. Addressing these challenges, this study introduces a novel denoising algorithm, Block-Matching Steered-Mixture of Experts with Multi-Model Inference and Autoencoder (BM-SMoE-AE). This method combines block-matched implementation of the SMoE algorithm with an enhanced autoencoder architecture, offering efficient speckle noise reduction while retaining critical image details. Our method stands out by providing improved edge definition and reduced processing time. Comparative analysis with existing denoising techniques demonstrates the superior performance of BM-SMoE-AE in maintaining image integrity and enhancing OCT image usability for medical diagnostics.Comment: This submission contains 10 pages and 4 figures. It was presented at the 2024 SPIE Photonics West, held in San Francisco. The paper details advancements in photonics applications related to healthcare and includes supplementary material with additional datasets for revie

    Steered mixture-of-experts for light field images and video : representation and coding

    Get PDF
    Research in light field (LF) processing has heavily increased over the last decade. This is largely driven by the desire to achieve the same level of immersion and navigational freedom for camera-captured scenes as it is currently available for CGI content. Standardization organizations such as MPEG and JPEG continue to follow conventional coding paradigms in which viewpoints are discretely represented on 2-D regular grids. These grids are then further decorrelated through hybrid DPCM/transform techniques. However, these 2-D regular grids are less suited for high-dimensional data, such as LFs. We propose a novel coding framework for higher-dimensional image modalities, called Steered Mixture-of-Experts (SMoE). Coherent areas in the higher-dimensional space are represented by single higher-dimensional entities, called kernels. These kernels hold spatially localized information about light rays at any angle arriving at a certain region. The global model consists thus of a set of kernels which define a continuous approximation of the underlying plenoptic function. We introduce the theory of SMoE and illustrate its application for 2-D images, 4-D LF images, and 5-D LF video. We also propose an efficient coding strategy to convert the model parameters into a bitstream. Even without provisions for high-frequency information, the proposed method performs comparable to the state of the art for low-to-mid range bitrates with respect to subjective visual quality of 4-D LF images. In case of 5-D LF video, we observe superior decorrelation and coding performance with coding gains of a factor of 4x in bitrate for the same quality. At least equally important is the fact that our method inherently has desired functionality for LF rendering which is lacking in other state-of-the-art techniques: (1) full zero-delay random access, (2) light-weight pixel-parallel view reconstruction, and (3) intrinsic view interpolation and super-resolution

    Adapting Computer Vision Models To Limitations On Input Dimensionality And Model Complexity

    Get PDF
    When considering instances of distributed systems where visual sensors communicate with remote predictive models, data traffic is limited to the capacity of communication channels, and hardware limits the processing of collected data prior to transmission. We study novel methods of adapting visual inference to limitations on complexity and data availability at test time, wherever the aforementioned limitations exist. Our contributions detailed in this thesis consider both task-specific and task-generic approaches to reducing the data requirement for inference, and evaluate our proposed methods on a wide range of computer vision tasks. This thesis makes four distinct contributions: (i) We investigate multi-class action classification via two-stream convolutional neural networks that directly ingest information extracted from compressed video bitstreams. We show that selective access to macroblock motion vector information provides a good low-dimensional approximation of the underlying optical flow in visual sequences. (ii) We devise a bitstream cropping method by which AVC/H.264 and H.265 bitstreams are reduced to the minimum amount of necessary elements for optical flow extraction, while maintaining compliance with codec standards. We additionally study the effect of codec rate-quality control on the sparsity and noise incurred on optical flow derived from resulting bitstreams, and do so for multiple coding standards. (iii) We demonstrate degrees of variability in the amount of data required for action classification, and leverage this to reduce the dimensionality of input volumes by inferring the required temporal extent for accurate classification prior to processing via learnable machines. (iv) We extend the Mixtures-of-Experts (MoE) paradigm to adapt the data cost of inference for any set of constituent experts. We postulate that the minimum acceptable data cost of inference varies for different input space partitions, and consider mixtures where each expert is designed to meet a different set of constraints on input dimensionality. To take advantage of the flexibility of such mixtures in processing different input representations and modalities, we train biased gating functions such that experts requiring less information to make their inferences are favoured to others. We finally note that, our proposed data utility optimization solutions include a learnable component which considers specified priorities on the amount of information to be used prior to inference, and can be realized for any combination of tasks, modalities, and constraints on available data

    A Machine Learning Approach to Indoor Localization Data Mining

    Get PDF
    Indoor positioning systems are increasingly commonplace in various environments and produce large quantities of data. They are used in industrial applications, robotics, asset and employee tracking just to name a few use cases. The growing amount of data and the accelerating progress of machine learning opens up many new possibilities for analyzing this data in ways that were not conceivable or relevant before. This paper introduces connected concepts and implementations to answer question how this data can be utilized. Data gathered in this thesis originates from an indoor positioning system deployed in retail environment, but the discussed methods can be applied generally. The issue will be approached by first introducing the concept of machine learning and more generally, artificial intelligence, and how they work on a general level. A deeper dive is done to subfields and algorithms that are relevant to the data mining task at hand. Indoor positioning system basics are also shortly discussed to create a base understanding on the realistic capabilities and constraints that these kinds of systems encase. These methods and previous knowledge from literature are put to test with the freshly gathered data. An algorithm based on existing example from literature was tested and improved upon with the new data. A novel method to cluster and classify movement patterns was introduced, utilizing deep learning to create embedded representations of the trajectories in a more complex learning pipeline. This type of learning is often referred to as deep clustering. The results are promising and both of the methods produce useful high level representations of the complex dataset that can help a human operator to discern the relevant patterns from raw data and to be used as an input for subsequent supervised and unsupervised learning steps. Several factors related to optimizing the learning pipeline, such as regularization were also researched and the results presented as visualizations. The research found that pipeline consisting of CNN-autoencoder followed by a classic clustering algorithm such as DBSCAN produces useful results in the form of trajectory clusters. Regularization such as L1 regression improves this performance. The research done in this paper presents useful algorithms for processing raw, noisy localization data from indoor environments that can be used for further implementations in both industrial applications and academia

    Feedforward deep architectures for classification and synthesis

    Full text link
    Cette thĂšse par article prĂ©sente plusieurs contributions au domaine de l'apprentissage de reprĂ©sentations profondes, avec des applications aux problĂšmes de classification et de synthĂšse d'images naturelles. Plus spĂ©cifiquement, cette thĂšse prĂ©sente plusieurs nouvelles techniques pour la construction et l'entraĂźnment de rĂ©seaux neuronaux profonds, ainsi q'une Ă©tude empirique de la technique de «dropout», une des approches de rĂ©gularisation les plus populaires des derniĂšres annĂ©es. Le premier article prĂ©sente une nouvelle fonction d'activation linĂ©aire par morceau, appellĂ©e «maxout», qui permet Ă  chaque unitĂ© cachĂ©e d'un rĂ©seau de neurones d'apprendre sa propre fonction d'activation convexe. Nous dĂ©montrons une performance amĂ©liorĂ©e sur plusieurs tĂąches d'Ă©valuation du domaine de reconnaissance d'objets, et nous examinons empiriquement les sources de cette amĂ©lioration, y compris une meilleure synergie avec la mĂ©thode de rĂ©gularisation «dropout» rĂ©cemment proposĂ©e. Le second article poursuit l'examen de la technique «dropout». Nous nous concentrons sur les rĂ©seaux avec fonctions d'activation rectifiĂ©es linĂ©aires (ReLU) et rĂ©pondons empiriquement Ă  plusieurs questions concernant l'efficacitĂ© remarquable de «dropout» en tant que rĂ©gularisateur, incluant les questions portant sur la mĂ©thode rapide de rĂ©Ă©chelonnement au temps de lÂŽĂ©valuation et la moyenne gĂ©ometrique que cette mĂ©thode approxime, l'interprĂ©tation d'ensemble comparĂ©e aux ensembles traditionnels, et l'importance d'employer des critĂšres similaires au «bagging» pour l'optimisation. Le troisiĂšme article s'intĂ©resse Ă  un problĂšme pratique de l'application Ă  l'Ă©chelle industrielle de rĂ©seaux neuronaux profonds au problĂšme de reconnaissance d'objets avec plusieurs etiquettes, nommĂ©ment l'amĂ©lioration de la capacitĂ© d'un modĂšle Ă  discriminer entre des Ă©tiquettes frĂ©quemment confondues. Nous rĂ©solvons le problĂšme en employant la prĂ©diction du rĂ©seau des sous-composantes dĂ©diĂ©es Ă  chaque sous-ensemble de la partition. Finalement, le quatriĂšme article s'attaque au problĂšme de l'entraĂźnment de modĂšles gĂ©nĂ©ratifs adversariaux (GAN) rĂ©cemment proposĂ©. Nous prĂ©sentons une procĂ©dure d'entraĂźnment amĂ©liorĂ©e employant un auto-encodeur dĂ©bruitant, entraĂźnĂ© dans un espace caractĂ©ristiques abstrait appris par le discriminateur, pour guider le gĂ©nĂ©rateur Ă  apprendre un encodage qui s'aligne de plus prĂšs aux donnĂ©es. Nous Ă©valuons le modĂšle avec le score «Inception» rĂ©cemment proposĂ©.This thesis by articles makes several contributions to the field of deep learning, with applications to both classification and synthesis of natural images. Specifically, we introduce several new techniques for the construction and training of deep feedforward networks, and present an empirical investigation into dropout, one of the most popular regularization strategies of the last several years. In the first article, we present a novel piece-wise linear parameterization of neural networks, maxout, which allows each hidden unit of a neural network to effectively learn its own convex activation function. We demonstrate improvements on several object recognition benchmarks, and empirically investigate the source of these improvements, including an improved synergy with the recently proposed dropout regularization method. In the second article, we further interrogate the dropout algorithm in particular. Focusing on networks of the popular rectified linear units (ReLU), we empirically examine several questions regarding dropout’s remarkable effectiveness as a regularizer, including questions surrounding the fast test-time rescaling trick and the geometric mean it approximates, interpretations as an ensemble as compared with traditional ensembles, and the importance of using a bagging-like criterion for optimization. In the third article, we address a practical problem in industrial-scale application of deep networks for multi-label object recognition, namely improving an existing model’s ability to discriminate between frequently confused classes. We accomplish this by using the network’s own predictions to inform a partitioning of the label space, and augment the network with dedicated discriminative capacity addressing each of the partitions. Finally, in the fourth article, we tackle the problem of fitting implicit generative models of open domain collections of natural images using the recently introduced Generative Adversarial Networks (GAN) paradigm. We introduce an augmented training procedure which employs a denoising autoencoder, trained in a high-level feature space learned by the discriminator, to guide the generator towards feature encodings which more closely resemble the data. We quantitatively evaluate our findings using the recently proposed Inception score

    Learning continuous models for estimating intrinsic component images

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Also issued in pages.MIT Rotch Library copy: issued in pages.Includes bibliographical references (leaves 137-144).The goal of computer vision is to use an image to recover the characteristics of a scene, such as its shape or illumination. This is difficult because an image is the mixture of multiple characteristics. For example, an edge in an image could be caused by either an edge on a surface or a change in the surface's color. Distinguishing the effects of different scene characteristics is an important step towards high-level analysis of an image. This thesis describes how to use machine learning to build a system that recovers different characteristics of the scene from a single, gray-scale image of the scene. The goal of the system is to use the observed image to recover images, referred to as Intrinsic Component Images, that represent the scene's characteristics. The development of the system is focused on estimating two important characteristics of a scene, its shading and reflectance, from a single image. From the observed image, the system estimates a shading image, which captures the interaction of the illumination and shape of the scene pictured, and an albedo image, which represents how the surfaces in the image reflect light. Measured both qualitatively and quantitatively, this system produces state-of-the-art estimates of shading and albedo images.(cont.) This system is also flexible enough to be used for the separate problem of removing noise from an image. Building this system requires algorithms for continuous regression and learning the parameters of a Conditionally Gaussian Markov Random Field. Unlike previous work, this system is trained using real-world surfaces with ground-truth shading and albedo images. The learning algorithms are designed to accommodate the large amount of data in this training set.by Marshall Friend Tappen.Ph.D

    Improving sampling, optimization and feature extraction in Boltzmann machines

    Full text link
    L’apprentissage supervisĂ© de rĂ©seaux hiĂ©rarchiques Ă  grande Ă©chelle connaĂźt prĂ©sentement un succĂšs fulgurant. MalgrĂ© cette effervescence, l’apprentissage non-supervisĂ© reprĂ©sente toujours, selon plusieurs chercheurs, un Ă©lĂ©ment clĂ© de l’Intelligence Artificielle, oĂč les agents doivent apprendre Ă  partir d’un nombre potentiellement limitĂ© de donnĂ©es. Cette thĂšse s’inscrit dans cette pensĂ©e et aborde divers sujets de recherche liĂ©s au problĂšme d’estimation de densitĂ© par l’entremise des machines de Boltzmann (BM), modĂšles graphiques probabilistes au coeur de l’apprentissage profond. Nos contributions touchent les domaines de l’échantillonnage, l’estimation de fonctions de partition, l’optimisation ainsi que l’apprentissage de reprĂ©sentations invariantes. Cette thĂšse dĂ©bute par l’exposition d’un nouvel algorithme d'Ă©chantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la tempĂ©rature des chaĂźnes de Markov sous simulation, afin de maintenir une vitesse de convergence Ă©levĂ©e tout au long de l’apprentissage. Lorsqu’utilisĂ© dans le contexte de l’apprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face Ă  la sĂ©lection du taux d’apprentissage, ainsi qu’une meilleure vitesse de convergence. Nos rĂ©sultats sont prĂ©sent ́es dans le domaine des BMs, mais la mĂ©thode est gĂ©nĂ©rale et applicable Ă  l’apprentissage de tout modĂšle probabiliste exploitant l’échantillonnage par chaĂźnes de Markov. Tandis que le gradient du maximum de vraisemblance peut-ĂȘtre approximĂ© par Ă©chantillonnage, l’évaluation de la log-vraisemblance nĂ©cessite un estimĂ© de la fonction de partition. Contrairement aux approches traditionnelles qui considĂšrent un modĂšle donnĂ© comme une boĂźte noire, nous proposons plutĂŽt d’exploiter la dynamique de l’apprentissage en estimant les changements successifs de log-partition encourus Ă  chaque mise Ă  jour des paramĂštres. Le problĂšme d’estimation est reformulĂ© comme un problĂšme d’infĂ©rence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, oĂč les dimensions correspondent aux axes du temps et au paramĂštre de tempĂ©rature. Sur le thĂšme de l’optimisation, nous prĂ©sentons Ă©galement un algorithme permettant d’appliquer, de maniĂšre efficace, le gradient naturel Ă  des machines de Boltzmann comportant des milliers d’unitĂ©s. Jusqu’à prĂ©sent, son adoption Ă©tait limitĂ©e par son haut coĂ»t computationel ainsi que sa demande en mĂ©moire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet d’éviter le calcul explicite de la matrice d’information de Fisher (et son inverse) en exploitant un solveur linĂ©aire combinĂ© Ă  un produit matrice-vecteur efficace. L’algorithme est prometteur: en terme du nombre d’évaluations de fonctions, MFNG converge plus rapidement que SML. Son implĂ©mentation demeure malheureusement inefficace en temps de calcul. Ces travaux explorent Ă©galement les mĂ©canismes sous-jacents Ă  l’apprentissage de reprĂ©sentations invariantes. À cette fin, nous utilisons la famille de machines de Boltzmann restreintes “spike & slab” (ssRBM), que nous modifions afin de pouvoir modĂ©liser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent ĂȘtre rendues invariantes Ă  un sous-espace vectoriel, en associant Ă  chacune d’elles, un vecteur de variables latentes continues (dĂ©nommĂ©es “slabs”). Ceci se traduit par une invariance accrue au niveau de la reprĂ©sentation et un meilleur taux de classification lorsque peu de donnĂ©es Ă©tiquetĂ©es sont disponibles. Nous terminons cette thĂšse sur un sujet ambitieux: l’apprentissage de reprĂ©sentations pouvant sĂ©parer les facteurs de variations prĂ©sents dans le signal d’entrĂ©e. Nous proposons une solution Ă  base de ssRBM bilinĂ©aire (avec deux groupes de facteurs latents) et formulons le problĂšme comme l’un de “pooling” dans des sous-espaces vectoriels complĂ©mentaires.Despite the current widescale success of deep learning in training large scale hierarchical models through supervised learning, unsupervised learning promises to play a crucial role towards solving general Artificial Intelligence, where agents are expected to learn with little to no supervision. The work presented in this thesis tackles the problem of unsupervised feature learning and density estimation, using a model family at the heart of the deep learning phenomenon: the Boltzmann Machine (BM). We present contributions in the areas of sampling, partition function estimation, optimization and the more general topic of invariant feature learning. With regards to sampling, we present a novel adaptive parallel tempering method which dynamically adjusts the temperatures under simulation to maintain good mixing in the presence of complex multi-modal distributions. When used in the context of stochastic maximum likelihood (SML) training, the improved ergodicity of our sampler translates to increased robustness to learning rates and faster per epoch convergence. Though our application is limited to BM, our method is general and is applicable to sampling from arbitrary probabilistic models using Markov Chain Monte Carlo (MCMC) techniques. While SML gradients can be estimated via sampling, computing data likelihoods requires an estimate of the partition function. Contrary to previous approaches which consider the model as a black box, we provide an efficient algorithm which instead tracks the change in the log partition function incurred by successive parameter updates. Our algorithm frames this estimation problem as one of filtering performed over a 2D lattice, with one dimension representing time and the other temperature. On the topic of optimization, our thesis presents a novel algorithm for applying the natural gradient to large scale Boltzmann Machines. Up until now, its application had been constrained by the computational and memory requirements of computing the Fisher Information Matrix (FIM), which is square in the number of parameters. The Metric-Free Natural Gradient algorithm (MFNG) avoids computing the FIM altogether by combining a linear solver with an efficient matrix-vector operation. The method shows promise in that the resulting updates yield faster per-epoch convergence, despite being slower in terms of wall clock time. Finally, we explore how invariant features can be learnt through modifications to the BM energy function. We study the problem in the context of the spike & slab Restricted Boltzmann Machine (ssRBM), which we extend to handle both binary and sparse input distributions. By associating each spike with several slab variables, latent variables can be made invariant to a rich, high dimensional subspace resulting in increased invariance in the learnt representation. When using the expected model posterior as input to a classifier, increased invariance translates to improved classification accuracy in the low-label data regime. We conclude by showing a connection between invariance and the more powerful concept of disentangling factors of variation. While invariance can be achieved by pooling over subspaces, disentangling can be achieved by learning multiple complementary views of the same subspace. In particular, we show how this can be achieved using third-order BMs featuring multiplicative interactions between pairs of random variables

    Deep learning of representations and its application to computer vision

    Get PDF
    L’objectif de cette thĂšse par articles est de prĂ©senter modestement quelques Ă©tapes du parcours qui mĂšnera (on espĂšre) Ă  une solution gĂ©nĂ©rale du problĂšme de l’intelligence artificielle. Cette thĂšse contient quatre articles qui prĂ©sentent chacun une diffĂ©rente nouvelle mĂ©thode d’infĂ©rence perceptive en utilisant l’apprentissage machine et, plus particuliĂšrement, les rĂ©seaux neuronaux profonds. Chacun de ces documents met en Ă©vidence l’utilitĂ© de sa mĂ©thode proposĂ©e dans le cadre d’une tĂąche de vision par ordinateur. Ces mĂ©thodes sont applicables dans un contexte plus gĂ©nĂ©ral, et dans certains cas elles on tĂ©tĂ© appliquĂ©es ailleurs, mais ceci ne sera pas abordĂ© dans le contexte de cette de thĂšse. Dans le premier article, nous prĂ©sentons deux nouveaux algorithmes d’infĂ©rence variationelle pour le modĂšle gĂ©nĂ©ratif d’images appelĂ© codage parcimonieux “spike- and-slab” (CPSS). Ces mĂ©thodes d’infĂ©rence plus rapides nous permettent d’utiliser des modĂšles CPSS de tailles beaucoup plus grandes qu’auparavant. Nous dĂ©montrons qu’elles sont meilleures pour extraire des dĂ©tecteur de caractĂ©ristiques quand trĂšs peu d’exemples Ă©tiquetĂ©s sont disponibles pour l’entraĂźnement. Partant d’un modĂšle CPSS, nous construisons ensuite une architecture profonde, la machine de Boltzmann profonde partiellement dirigĂ©e (MBP-PD). Ce modĂšle a Ă©tĂ© conçu de maniĂšre Ă  simplifier d’entraĂźnement des machines de Boltzmann profondes qui nĂ©cessitent normalement une phase de prĂ©-entraĂźnement glouton pour chaque couche. Ce problĂšme est rĂ©glĂ© dans une certaine mesure, mais le coĂ»t d’infĂ©rence dans le nouveau modĂšle est relativement trop Ă©levĂ© pour permettre de l’utiliser de maniĂšre pratique. Dans le deuxiĂšme article, nous revenons au problĂšme d’entraĂźnement joint de machines de Boltzmann profondes. Cette fois, au lieu de changer de famille de modĂšles, nous introduisons un nouveau critĂšre d’entraĂźnement qui donne naissance aux machines de Boltzmann profondes Ă  multiples prĂ©dictions (MBP-MP). Les MBP-MP sont entraĂźnables en une seule Ă©tape et ont un meilleur taux de succĂšs en classification que les MBP classiques. Elles s’entraĂźnent aussi avec des mĂ©thodes variationelles standard au lieu de nĂ©cessiter un classificateur discriminant pour obtenir un bon taux de succĂšs en classification. Par contre, un des inconvĂ©nients de tels modĂšles est leur incapacitĂ© de gĂ©nĂ©rer desĂ©chantillons, mais ceci n’est pas trop grave puisque la performance de classification des machines de Boltzmann profondes n’est plus une prioritĂ© Ă©tant donnĂ© les derniĂšres avancĂ©es en apprentissage supervisĂ©. MalgrĂ© cela, les MBP-MP demeurent intĂ©ressantes parce qu’elles sont capable d’accomplir certaines tĂąches que des modĂšles purement supervisĂ©s ne peuvent pas faire, telles que celle de classifier des donnĂ©es incomplĂštes ou encore celle de combler intelligemment l’information manquante dans ces donnĂ©es incomplĂštes. ïżŒLe travail prĂ©sentĂ© dans cette thĂšse s’est dĂ©roulĂ© au milieu d’une pĂ©riode de transformations importantes du domaine de l’apprentissage Ă  rĂ©seaux neuronaux profonds qui a Ă©tĂ© dĂ©clenchĂ©e par la dĂ©couverte de l’algorithme de “dropout” par Geoffrey Hinton. Dropout rend possible un entraĂźnement purement supervisĂ© d’architectures de propagation unidirectionnel sans ĂȘtre exposĂ© au danger de sur- entraĂźnement. Le troisiĂšme article prĂ©sentĂ© dans cette thĂšse introduit une nouvelle fonction d’activation spĂ©cialement con ̧cue pour aller avec l’algorithme de Dropout. Cette fonction d’activation, appelĂ©e maxout, permet l’utilisation de aggrĂ©gation multi-canal dans un contexte d’apprentissage purement supervisĂ©. Nous dĂ©montrons comment plusieurs tĂąches de reconnaissance d’objets sont mieux accomplies par l’utilisation de maxout. Pour terminer, sont prĂ©sentons un vrai cas d’utilisation dans l’industrie pour la transcription d’adresses de maisons Ă  plusieurs chiffres. En combinant maxout avec une nouvelle sorte de couche de sortie pour des rĂ©seaux neuronaux de convolution, nous dĂ©montrons qu’il est possible d’atteindre un taux de succĂšs comparable Ă  celui des humains sur un ensemble de donnĂ©es coriace constituĂ© de photos prises par les voitures de Google. Ce systĂšme a Ă©tĂ© dĂ©ployĂ© avec succĂšs chez Google pour lire environ cent million d’adresses de maisons.The goal of this thesis is to present a few small steps along the road to solving general artificial intelligence. This is a thesis by articles containing four articles. Each of these articles presents a new method for performing perceptual inference using machine learning and deep architectures. Each of these papers demonstrates the utility of the proposed method in the context of a computer vision task. The methods are more generally applicable and in some cases have been applied to other kinds of tasks, but this thesis does not explore such applications. In the first article, we present two fast new variational inference algorithms for a generative model of images known as spike-and-slab sparse coding (S3C). These faster inference algorithms allow us to scale spike-and-slab sparse coding to unprecedented problem sizes and show that it is a superior feature extractor for object recognition tasks when very few labeled examples are available. We then build a new deep architecture, the partially-directed deep Boltzmann machine (PD- DBM) on top of the S3C model. This model was designed to simplify the training procedure for deep Boltzmann machines, which previously required a greedy layer-wise pretraining procedure. This model partially succeeds at solving this problem, but the cost of inference in the new model is high enough that it makes scaling the model to serious applications difficult. In the second article, we revisit the problem of jointly training deep Boltzmann machines. This time, rather than changing the model family, we present a new training criterion, resulting in multi-prediction deep Boltzmann machines (MP- DBMs). MP-DBMs may be trained in a single stage and obtain better classification accuracy than traditional DBMs. They also are able to classify well using standard variational inference techniques, rather than requiring a separate, specialized, discriminatively trained classifier to obtain good classification performance. However, this comes at the cost of the model not being able to generate good samples. The classification performance of deep Boltzmann machines is no longer especially interesting following recent advances in supervised learning, but the MP-DBM remains interesting because it can perform tasks that purely supervised models cannot, such as classification in the presence of missing inputs and imputation of missing inputs. The general zeitgeist of deep learning research changed dramatically during the midst of the work on this thesis with the introduction of Geoffrey Hinton’s dropout algorithm. Dropout permits purely supervised training of feedforward architectures with little overfitting. The third paper in this thesis presents a new activation function for feedforward neural networks which was explicitly designed to work well with dropout. This activation function, called maxout, makes it possible to learn architectures that leverage the benefits of cross-channel pooling in a purely ïżŒsupervised manner. We demonstrate improvements on several object recognition tasks using this activation function. Finally, we solve a real world task: transcription of photos of multi-digit house numbers for geo-coding. Using maxout units and a new kind of output layer for convolutional neural networks, we demonstrate human level accuracy (with limited coverage) on a challenging real-world dataset. This system has been deployed at Google and successfully used to transcribe nearly 100 million house numbers
    • 

    corecore