70 research outputs found

    Information Theory and Machine Learning

    Get PDF
    The recent successes of machine learning, especially regarding systems based on deep neural networks, have encouraged further research activities and raised a new set of challenges in understanding and designing complex machine learning algorithms. New applications require learning algorithms to be distributed, have transferable learning results, use computation resources efficiently, convergence quickly on online settings, have performance guarantees, satisfy fairness or privacy constraints, incorporate domain knowledge on model structures, etc. A new wave of developments in statistical learning theory and information theory has set out to address these challenges. This Special Issue, "Machine Learning and Information Theory", aims to collect recent results in this direction reflecting a diverse spectrum of visions and efforts to extend conventional theories and develop analysis tools for these complex machine learning systems

    Learning reliable representations when proxy objectives fail

    Get PDF
    Representation learning involves using an objective to learn a mapping from data space to a representation space. When the downstream task for which a mapping must be learned is unknown, or is too costly to cast as an objective, we must rely on proxy objectives for learning. In this Thesis I focus on representation learning for images, and address three cases where proxy objectives fail to produce a mapping that performs well on the downstream tasks. When learning neural network mappings from image space to a discrete hash space for fast content-based image retrieval, a proxy objective is needed which captures the requirement for relevant responses to be nearer to the hash of any query than irrelevant ones. At the same time, it is important to ensure an even distribution of image hashes across the whole hash space for efficient information use and high discrimination. Proxy objectives fail when they do not meet these requirements. I propose composing hash codes in two parts. First a standard classifier is used to predict class labels that are converted to a binary representation for state-of-the-art performance on the image retrieval task. Second, a binary deep decision tree layer (DDTL) is used to model further intra-class differences and produce approximately evenly distributed hash codes. The DDTL requires no discretisation during learning and produces hash codes that enable better discrimination between data in the same class when compared to previous methods, while remaining robust to real-world augmentations in the data space. In the scenario where we require a neural network to partition the data into clusters that correspond well with ground-truth labels, a proxy objective is needed to define how these clusters are formed. One such proxy objective involves maximising the mutual information between cluster assignments made by a neural network from multiple views. In this context, views are different augmentations of the same image and the cluster assignments are the representations computed by a neural network. I demonstrate that this proxy objective produces parameters for the neural network that are sub-optimal in that a better set of parameters can be found using the same objective and a different training method. I introduce deep hierarchical object grouping (DHOG) as a method to learn a hierarchy (in the sense of easy-to-hard orderings, not structure) of solutions to the proxy objective and show how this improves performance on the downstream task. When there are features in the training data from which it is easier to compute class predictions (e.g., background colour), when compared to features for which it is relatively more difficult to compute class predictions (e.g., digit type), standard classification objectives (e.g., cross-entropy) fail to produce robust classifiers. The problem is that if a model learns to rely on `easy' features it will also ignore `complex' features (easy versus complex are purely relative in this case). I introduce latent adversarial debiasing (LAD) to decouple easy features from the class labels by first modelling the underlying structure of the training data as a latent representation using a vector-quantised variational autoencoder, and then I use a gradient-based procedure to adjust the features in this representation to confuse the predictions of a constrained classifier trained to predict class labels from the same representation. The adjusted representations of the data are then decoded to produce an augmented training dataset that can be used for training in a standard manner. I show in the aforementioned scenarios that proxy objectives can fail and demonstrate that alternative approaches can mitigate against the associated failures. I suggest an analytic approach to understanding the limits of proxy objectives for every use case in order to make the adjustments to the data or the objectives and ensure good performance on downstream tasks

    Video traffic : characterization, modelling and transmission

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Information Bottleneck

    Get PDF
    The celebrated information bottleneck (IB) principle of Tishby et al. has recently enjoyed renewed attention due to its application in the area of deep learning. This collection investigates the IB principle in this new context. The individual chapters in this collection: ‱ provide novel insights into the functional properties of the IB; ‱ discuss the IB principle (and its derivates) as an objective for training multi-layer machine learning structures such as neural networks and decision trees; and ‱ offer a new perspective on neural network learning via the lens of the IB framework. Our collection thus contributes to a better understanding of the IB principle specifically for deep learning and, more generally, of information–theoretic cost functions in machine learning. This paves the way toward explainable artificial intelligence

    Representation of speech in the primary auditory cortex and its implications for robust speech processing

    Get PDF
    Speech has evolved as a primary form of communication between humans. This most used means of communication has been the subject of intense study for years, but there is still a lot that we do not know about it. It is an oft repeated fact, that even the performance of the best speech processing algorithms still lags far behind that of the average human, It seems inescapable that unless we know more about the way the brain performs this task, our machines can not go much further. This thesis focuses on the question of speech representation in the brain, both from a physiological and technological perspective. We explore the representation of speech through the encoding of its smallest elements - phonemic features - in the primary auditory cortex. We report on how population of neurons with diverse tuning properties respond discriminately to phonemes resulting in explicit encoding of their parameters. Next, we show that this sparse encoding of the phonemic features is a simple consequence of the linear spectro-temporal properties of the auditory cortical neurons and that a Spectro-Temporal receptive field model can predict similar patterns of activation. This is an important step toward the realization of systems that operate based on the same principles as the cortex. Using an inverse method of reconstruction, we shall also explore the extent to which phonemic features are preserved in the cortical representation of noisy speech. The results suggest that the cortical responses are more robust to noise and that the important features of phonemes are preserved in the cortical representation even in noise. Finally, we explain how a model of this cortical representation can be used for speech processing and enhancement applications to improve their robustness and performance

    Acta Cybernetica : Volume 14. Number 2.

    Get PDF

    Improving sampling, optimization and feature extraction in Boltzmann machines

    Full text link
    L’apprentissage supervisĂ© de rĂ©seaux hiĂ©rarchiques Ă  grande Ă©chelle connaĂźt prĂ©sentement un succĂšs fulgurant. MalgrĂ© cette effervescence, l’apprentissage non-supervisĂ© reprĂ©sente toujours, selon plusieurs chercheurs, un Ă©lĂ©ment clĂ© de l’Intelligence Artificielle, oĂč les agents doivent apprendre Ă  partir d’un nombre potentiellement limitĂ© de donnĂ©es. Cette thĂšse s’inscrit dans cette pensĂ©e et aborde divers sujets de recherche liĂ©s au problĂšme d’estimation de densitĂ© par l’entremise des machines de Boltzmann (BM), modĂšles graphiques probabilistes au coeur de l’apprentissage profond. Nos contributions touchent les domaines de l’échantillonnage, l’estimation de fonctions de partition, l’optimisation ainsi que l’apprentissage de reprĂ©sentations invariantes. Cette thĂšse dĂ©bute par l’exposition d’un nouvel algorithme d'Ă©chantillonnage adaptatif, qui ajuste (de fa ̧con automatique) la tempĂ©rature des chaĂźnes de Markov sous simulation, afin de maintenir une vitesse de convergence Ă©levĂ©e tout au long de l’apprentissage. Lorsqu’utilisĂ© dans le contexte de l’apprentissage par maximum de vraisemblance stochastique (SML), notre algorithme engendre une robustesse accrue face Ă  la sĂ©lection du taux d’apprentissage, ainsi qu’une meilleure vitesse de convergence. Nos rĂ©sultats sont prĂ©sent ́es dans le domaine des BMs, mais la mĂ©thode est gĂ©nĂ©rale et applicable Ă  l’apprentissage de tout modĂšle probabiliste exploitant l’échantillonnage par chaĂźnes de Markov. Tandis que le gradient du maximum de vraisemblance peut-ĂȘtre approximĂ© par Ă©chantillonnage, l’évaluation de la log-vraisemblance nĂ©cessite un estimĂ© de la fonction de partition. Contrairement aux approches traditionnelles qui considĂšrent un modĂšle donnĂ© comme une boĂźte noire, nous proposons plutĂŽt d’exploiter la dynamique de l’apprentissage en estimant les changements successifs de log-partition encourus Ă  chaque mise Ă  jour des paramĂštres. Le problĂšme d’estimation est reformulĂ© comme un problĂšme d’infĂ©rence similaire au filtre de Kalman, mais sur un graphe bi-dimensionnel, oĂč les dimensions correspondent aux axes du temps et au paramĂštre de tempĂ©rature. Sur le thĂšme de l’optimisation, nous prĂ©sentons Ă©galement un algorithme permettant d’appliquer, de maniĂšre efficace, le gradient naturel Ă  des machines de Boltzmann comportant des milliers d’unitĂ©s. Jusqu’à prĂ©sent, son adoption Ă©tait limitĂ©e par son haut coĂ»t computationel ainsi que sa demande en mĂ©moire. Notre algorithme, Metric-Free Natural Gradient (MFNG), permet d’éviter le calcul explicite de la matrice d’information de Fisher (et son inverse) en exploitant un solveur linĂ©aire combinĂ© Ă  un produit matrice-vecteur efficace. L’algorithme est prometteur: en terme du nombre d’évaluations de fonctions, MFNG converge plus rapidement que SML. Son implĂ©mentation demeure malheureusement inefficace en temps de calcul. Ces travaux explorent Ă©galement les mĂ©canismes sous-jacents Ă  l’apprentissage de reprĂ©sentations invariantes. À cette fin, nous utilisons la famille de machines de Boltzmann restreintes “spike & slab” (ssRBM), que nous modifions afin de pouvoir modĂ©liser des distributions binaires et parcimonieuses. Les variables latentes binaires de la ssRBM peuvent ĂȘtre rendues invariantes Ă  un sous-espace vectoriel, en associant Ă  chacune d’elles, un vecteur de variables latentes continues (dĂ©nommĂ©es “slabs”). Ceci se traduit par une invariance accrue au niveau de la reprĂ©sentation et un meilleur taux de classification lorsque peu de donnĂ©es Ă©tiquetĂ©es sont disponibles. Nous terminons cette thĂšse sur un sujet ambitieux: l’apprentissage de reprĂ©sentations pouvant sĂ©parer les facteurs de variations prĂ©sents dans le signal d’entrĂ©e. Nous proposons une solution Ă  base de ssRBM bilinĂ©aire (avec deux groupes de facteurs latents) et formulons le problĂšme comme l’un de “pooling” dans des sous-espaces vectoriels complĂ©mentaires.Despite the current widescale success of deep learning in training large scale hierarchical models through supervised learning, unsupervised learning promises to play a crucial role towards solving general Artificial Intelligence, where agents are expected to learn with little to no supervision. The work presented in this thesis tackles the problem of unsupervised feature learning and density estimation, using a model family at the heart of the deep learning phenomenon: the Boltzmann Machine (BM). We present contributions in the areas of sampling, partition function estimation, optimization and the more general topic of invariant feature learning. With regards to sampling, we present a novel adaptive parallel tempering method which dynamically adjusts the temperatures under simulation to maintain good mixing in the presence of complex multi-modal distributions. When used in the context of stochastic maximum likelihood (SML) training, the improved ergodicity of our sampler translates to increased robustness to learning rates and faster per epoch convergence. Though our application is limited to BM, our method is general and is applicable to sampling from arbitrary probabilistic models using Markov Chain Monte Carlo (MCMC) techniques. While SML gradients can be estimated via sampling, computing data likelihoods requires an estimate of the partition function. Contrary to previous approaches which consider the model as a black box, we provide an efficient algorithm which instead tracks the change in the log partition function incurred by successive parameter updates. Our algorithm frames this estimation problem as one of filtering performed over a 2D lattice, with one dimension representing time and the other temperature. On the topic of optimization, our thesis presents a novel algorithm for applying the natural gradient to large scale Boltzmann Machines. Up until now, its application had been constrained by the computational and memory requirements of computing the Fisher Information Matrix (FIM), which is square in the number of parameters. The Metric-Free Natural Gradient algorithm (MFNG) avoids computing the FIM altogether by combining a linear solver with an efficient matrix-vector operation. The method shows promise in that the resulting updates yield faster per-epoch convergence, despite being slower in terms of wall clock time. Finally, we explore how invariant features can be learnt through modifications to the BM energy function. We study the problem in the context of the spike & slab Restricted Boltzmann Machine (ssRBM), which we extend to handle both binary and sparse input distributions. By associating each spike with several slab variables, latent variables can be made invariant to a rich, high dimensional subspace resulting in increased invariance in the learnt representation. When using the expected model posterior as input to a classifier, increased invariance translates to improved classification accuracy in the low-label data regime. We conclude by showing a connection between invariance and the more powerful concept of disentangling factors of variation. While invariance can be achieved by pooling over subspaces, disentangling can be achieved by learning multiple complementary views of the same subspace. In particular, we show how this can be achieved using third-order BMs featuring multiplicative interactions between pairs of random variables

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
    • 

    corecore