404 research outputs found

    Discovery of Visual Semantics by Unsupervised and Self-Supervised Representation Learning

    Full text link
    The success of deep learning in computer vision is rooted in the ability of deep networks to scale up model complexity as demanded by challenging visual tasks. As complexity is increased, so is the need for large amounts of labeled data to train the model. This is associated with a costly human annotation effort. To address this concern, with the long-term goal of leveraging the abundance of cheap unlabeled data, we explore methods of unsupervised "pre-training." In particular, we propose to use self-supervised automatic image colorization. We show that traditional methods for unsupervised learning, such as layer-wise clustering or autoencoders, remain inferior to supervised pre-training. In search for an alternative, we develop a fully automatic image colorization method. Our method sets a new state-of-the-art in revitalizing old black-and-white photography, without requiring human effort or expertise. Additionally, it gives us a method for self-supervised representation learning. In order for the model to appropriately re-color a grayscale object, it must first be able to identify it. This ability, learned entirely self-supervised, can be used to improve other visual tasks, such as classification and semantic segmentation. As a future direction for self-supervision, we investigate if multiple proxy tasks can be combined to improve generalization. This turns out to be a challenging open problem. We hope that our contributions to this endeavor will provide a foundation for future efforts in making self-supervision compete with supervised pre-training.Comment: Ph.D. thesi

    Online Multi-Stage Deep Architectures for Feature Extraction and Object Recognition

    Get PDF
    Multi-stage visual architectures have recently found success in achieving high classification accuracies over image datasets with large variations in pose, lighting, and scale. Inspired by techniques currently at the forefront of deep learning, such architectures are typically composed of one or more layers of preprocessing, feature encoding, and pooling to extract features from raw images. Training these components traditionally relies on large sets of patches that are extracted from a potentially large image dataset. In this context, high-dimensional feature space representations are often helpful for obtaining the best classification performances and providing a higher degree of invariance to object transformations. Large datasets with high-dimensional features complicate the implementation of visual architectures in memory constrained environments. This dissertation constructs online learning replacements for the components within a multi-stage architecture and demonstrates that the proposed replacements (namely fuzzy competitive clustering, an incremental covariance estimator, and multi-layer neural network) can offer performance competitive with their offline batch counterparts while providing a reduced memory footprint. The online nature of this solution allows for the development of a method for adjusting parameters within the architecture via stochastic gradient descent. Testing over multiple datasets shows the potential benefits of this methodology when appropriate priors on the initial parameters are unknown. Alternatives to batch based decompositions for a whitening preprocessing stage which take advantage of natural image statistics and allow simple dictionary learners to work well in the problem domain are also explored. Expansions of the architecture using additional pooling statistics and multiple layers are presented and indicate that larger codebook sizes are not the only step forward to higher classification accuracies. Experimental results from these expansions further indicate the important role of sparsity and appropriate encodings within multi-stage visual feature extraction architectures

    Learning feature hierarchies for musical audio signals

    Get PDF

    Feature regularization and learning for human activity recognition.

    Get PDF
    Doctoral Degree. University of KwaZulu-Natal, Durban.Feature extraction is an essential component in the design of human activity recognition model. However, relying on extracted features alone for learning often makes the model a suboptimal model. Therefore, this research work seeks to address such potential problem by investigating feature regularization. Feature regularization is used for encapsulating discriminative patterns that are needed for better and efficient model learning. Firstly, a within-class subspace regularization approach is proposed for eigenfeatures extraction and regularization in human activity recognition. In this ap- proach, the within-class subspace is modelled using more eigenvalues from the reliable subspace to obtain a four-parameter modelling scheme. This model enables a better and true estimation of the eigenvalues that are distorted by the small sample size effect. This regularization is done in one piece, thereby avoiding undue complexity of modelling eigenspectrum differently. The whole eigenspace is used for performance evaluation because feature extraction and dimensionality reduction are done at a later stage of the evaluation process. Results show that the proposed approach has better discriminative capacity than several other subspace approaches for human activity recognition. Secondly, with the use of likelihood prior probability, a new regularization scheme that improves the loss function of deep convolutional neural network is proposed. The results obtained from this work demonstrate that a well regularized feature yields better class discrimination in human activity recognition. The major contribution of the thesis is the development of feature extraction strategies for determining discriminative patterns needed for efficient model learning

    Masked Conditional Neural Networks for Sound Recognition

    Get PDF
    Sound recognition has been studied for decades to grant machines the human hearing ability. The advances in this field help in a range of applications, from industrial ones such as fault detection in machines and noise monitoring to household applications such as surveillance and hearing aids. The problem of sound recognition like any pattern recognition task involves the reliability of the extracted features and the recognition model. The problem has been approached through decades of crafted features used collaboratively with models based on neural networks or statistical models such as Gaussian Mixtures and Hidden Markov models. Neural networks are currently being considered as a method to automate the feature extraction stage together with the already incorporated role of recognition. The performance of such models is approaching handcrafted features. Current neural network based models are not primarily designed for the nature of the sound signal, which may not optimally harness distinctive properties of the signal. This thesis proposes neural network models that exploit the nature of the time-frequency representation of the sound signal. We propose the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN). The CLNN is designed to account for the temporal dimension of a signal and behaves as the framework for the MCLNN. The MCLNN allows a filterbank-like behaviour to be embedded within the network using a specially designed binary mask. The masking subdivides the frequency range of a signal into bands and allows concurrent consideration of different feature combinations analogous to the manual handcrafting of the optimum set of features for a recognition task. The proposed models have been evaluated through an extensive set of experiments using a range of publicly available datasets of music genres and environmental sounds, where they surpass state-of-the-art Convolutional Neural Networks and several hand-crafted attempts

    Deep learning of representations and its application to computer vision

    Get PDF
    L’objectif de cette thèse par articles est de présenter modestement quelques étapes du parcours qui mènera (on espère) à une solution générale du problème de l’intelligence artificielle. Cette thèse contient quatre articles qui présentent chacun une différente nouvelle méthode d’inférence perceptive en utilisant l’apprentissage machine et, plus particulièrement, les réseaux neuronaux profonds. Chacun de ces documents met en évidence l’utilité de sa méthode proposée dans le cadre d’une tâche de vision par ordinateur. Ces méthodes sont applicables dans un contexte plus général, et dans certains cas elles on tété appliquées ailleurs, mais ceci ne sera pas abordé dans le contexte de cette de thèse. Dans le premier article, nous présentons deux nouveaux algorithmes d’inférence variationelle pour le modèle génératif d’images appelé codage parcimonieux “spike- and-slab” (CPSS). Ces méthodes d’inférence plus rapides nous permettent d’utiliser des modèles CPSS de tailles beaucoup plus grandes qu’auparavant. Nous démontrons qu’elles sont meilleures pour extraire des détecteur de caractéristiques quand très peu d’exemples étiquetés sont disponibles pour l’entraînement. Partant d’un modèle CPSS, nous construisons ensuite une architecture profonde, la machine de Boltzmann profonde partiellement dirigée (MBP-PD). Ce modèle a été conçu de manière à simplifier d’entraînement des machines de Boltzmann profondes qui nécessitent normalement une phase de pré-entraînement glouton pour chaque couche. Ce problème est réglé dans une certaine mesure, mais le coût d’inférence dans le nouveau modèle est relativement trop élevé pour permettre de l’utiliser de manière pratique. Dans le deuxième article, nous revenons au problème d’entraînement joint de machines de Boltzmann profondes. Cette fois, au lieu de changer de famille de modèles, nous introduisons un nouveau critère d’entraînement qui donne naissance aux machines de Boltzmann profondes à multiples prédictions (MBP-MP). Les MBP-MP sont entraînables en une seule étape et ont un meilleur taux de succès en classification que les MBP classiques. Elles s’entraînent aussi avec des méthodes variationelles standard au lieu de nécessiter un classificateur discriminant pour obtenir un bon taux de succès en classification. Par contre, un des inconvénients de tels modèles est leur incapacité de générer deséchantillons, mais ceci n’est pas trop grave puisque la performance de classification des machines de Boltzmann profondes n’est plus une priorité étant donné les dernières avancées en apprentissage supervisé. Malgré cela, les MBP-MP demeurent intéressantes parce qu’elles sont capable d’accomplir certaines tâches que des modèles purement supervisés ne peuvent pas faire, telles que celle de classifier des données incomplètes ou encore celle de combler intelligemment l’information manquante dans ces données incomplètes. Le travail présenté dans cette thèse s’est déroulé au milieu d’une période de transformations importantes du domaine de l’apprentissage à réseaux neuronaux profonds qui a été déclenchée par la découverte de l’algorithme de “dropout” par Geoffrey Hinton. Dropout rend possible un entraînement purement supervisé d’architectures de propagation unidirectionnel sans être exposé au danger de sur- entraînement. Le troisième article présenté dans cette thèse introduit une nouvelle fonction d’activation spécialement con ̧cue pour aller avec l’algorithme de Dropout. Cette fonction d’activation, appelée maxout, permet l’utilisation de aggrégation multi-canal dans un contexte d’apprentissage purement supervisé. Nous démontrons comment plusieurs tâches de reconnaissance d’objets sont mieux accomplies par l’utilisation de maxout. Pour terminer, sont présentons un vrai cas d’utilisation dans l’industrie pour la transcription d’adresses de maisons à plusieurs chiffres. En combinant maxout avec une nouvelle sorte de couche de sortie pour des réseaux neuronaux de convolution, nous démontrons qu’il est possible d’atteindre un taux de succès comparable à celui des humains sur un ensemble de données coriace constitué de photos prises par les voitures de Google. Ce système a été déployé avec succès chez Google pour lire environ cent million d’adresses de maisons.The goal of this thesis is to present a few small steps along the road to solving general artificial intelligence. This is a thesis by articles containing four articles. Each of these articles presents a new method for performing perceptual inference using machine learning and deep architectures. Each of these papers demonstrates the utility of the proposed method in the context of a computer vision task. The methods are more generally applicable and in some cases have been applied to other kinds of tasks, but this thesis does not explore such applications. In the first article, we present two fast new variational inference algorithms for a generative model of images known as spike-and-slab sparse coding (S3C). These faster inference algorithms allow us to scale spike-and-slab sparse coding to unprecedented problem sizes and show that it is a superior feature extractor for object recognition tasks when very few labeled examples are available. We then build a new deep architecture, the partially-directed deep Boltzmann machine (PD- DBM) on top of the S3C model. This model was designed to simplify the training procedure for deep Boltzmann machines, which previously required a greedy layer-wise pretraining procedure. This model partially succeeds at solving this problem, but the cost of inference in the new model is high enough that it makes scaling the model to serious applications difficult. In the second article, we revisit the problem of jointly training deep Boltzmann machines. This time, rather than changing the model family, we present a new training criterion, resulting in multi-prediction deep Boltzmann machines (MP- DBMs). MP-DBMs may be trained in a single stage and obtain better classification accuracy than traditional DBMs. They also are able to classify well using standard variational inference techniques, rather than requiring a separate, specialized, discriminatively trained classifier to obtain good classification performance. However, this comes at the cost of the model not being able to generate good samples. The classification performance of deep Boltzmann machines is no longer especially interesting following recent advances in supervised learning, but the MP-DBM remains interesting because it can perform tasks that purely supervised models cannot, such as classification in the presence of missing inputs and imputation of missing inputs. The general zeitgeist of deep learning research changed dramatically during the midst of the work on this thesis with the introduction of Geoffrey Hinton’s dropout algorithm. Dropout permits purely supervised training of feedforward architectures with little overfitting. The third paper in this thesis presents a new activation function for feedforward neural networks which was explicitly designed to work well with dropout. This activation function, called maxout, makes it possible to learn architectures that leverage the benefits of cross-channel pooling in a purely supervised manner. We demonstrate improvements on several object recognition tasks using this activation function. Finally, we solve a real world task: transcription of photos of multi-digit house numbers for geo-coding. Using maxout units and a new kind of output layer for convolutional neural networks, we demonstrate human level accuracy (with limited coverage) on a challenging real-world dataset. This system has been deployed at Google and successfully used to transcribe nearly 100 million house numbers

    Learning Actions That Reduce Variation in Objects

    No full text
    The variation in the data that a robot in the real world receives from its sensory inputs (i.e. its sensory data) will come from many sources. Much of this variation is the result of ground truths about the world, such as what class an object belongs to, its shape, its condition, and so on. Robots would like to infer this information so they can use it to reason. A considerable amount of additional variation in the data, however, arises as a result of the robot’s relative configuration compared to an object; that is, its relative position, orientation, focal depth, etc. Fortunately, a robot has direct control over this configural variation: it can perform actions such as tilting its head or shifting its gaze. The task of inferring ground truth from data is difficult, and is made much more difficult when data is affected by configural variation. This thesis explores an approach in which the robot learns to perform actions that minimize the amount of configural variation in its sensory data, making the task of inferring information about objects considerably easier. The value of this approach is demonstrated by classifying digits from the MNIST and USPS datasets that have been transformed in various ways so that they include various kinds of configural variation

    Learning Actions That Reduce Variation in Objects

    Get PDF
    The variation in the data that a robot in the real world receives from its sensory inputs (i.e. its sensory data) will come from many sources. Much of this variation is the result of ground truths about the world, such as what class an object belongs to, its shape, its condition, and so on. Robots would like to infer this information so they can use it to reason. A considerable amount of additional variation in the data, however, arises as a result of the robot’s relative configuration compared to an object; that is, its relative position, orientation, focal depth, etc. Fortunately, a robot has direct control over this configural variation: it can perform actions such as tilting its head or shifting its gaze. The task of inferring ground truth from data is difficult, and is made much more difficult when data is affected by configural variation. This thesis explores an approach in which the robot learns to perform actions that minimize the amount of configural variation in its sensory data, making the task of inferring information about objects considerably easier. The value of this approach is demonstrated by classifying digits from the MNIST and USPS datasets that have been transformed in various ways so that they include various kinds of configural variation
    • …
    corecore