9 research outputs found

    Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

    Get PDF
    This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, Num. 8, 201

    Learning representations for speech recognition using artificial neural networks

    Get PDF
    Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals

    A Study on Deep Learning: Training, Models and Applications

    Get PDF
    In the past few years, deep learning has become a very important research field that has attracted a lot of research interests, attributing to the development of the computational hardware like high performance GPUs, training deep models, such as fully-connected deep neural networks (DNNs) and convolutional neural networks (CNNs), from scratch becomes practical, and using well-trained deep models to deal with real-world large scale problems also becomes possible. This dissertation mainly focuses on three important problems in deep learning, i.e., training algorithm, computational models and applications, and provides several methods to improve the performances of different deep learning methods. The first method is a DNN training algorithm called Annealed Gradient Descent (AGD). This dissertation presents a theoretical analysis on the convergence properties and learning speed of AGD to show its benefits. Experimental results have shown that AGD yields comparable performance as SGD but it can significantly expedite training of DNNs in big data sets. Secondly, this dissertation proposes to apply a novel model, namely Hybrid Orthogonal Projection and Estimation (HOPE), to CNNs. HOPE can be viewed as a hybrid model to combine feature extraction with mixture models. The experimental results have shown that HOPE layers can significantly improve the performance of CNNs in the image classification tasks. The third proposed method is to apply CNNs to image saliency detection. In this approach, a gradient descent method is used to iteratively modify the input images based on pixel-wise gradients to reduce a pre-defined cost function. Moreover, SLIC superpixels and low level saliency features are applied to smooth and refine the saliency maps. Experimental results have shown that the proposed methods can generate high-quality salience maps. The last method is also for image saliency detection. However, this method is based on Generative Adversarial Network (GAN). Different from GAN, the proposed method uses fully supervised learning to learn G-Network and D-Network. Therefore, it is called Supervised Adversarial Network (SAN). Moreover, SAN introduces a different G-Network and conv-comparison layers to further improve the saliency performance. Experimental results show that the SAN model can also generate state-of-the-art saliency maps for complicate images

    Advances in deep learning methods for speech recognition and understanding

    Full text link
    Ce travail expose plusieurs études dans les domaines de la reconnaissance de la parole et compréhension du langage parlé. La compréhension sémantique du langage parlé est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intéresse depuis longtemps les chercheurs, puisque la parole est une des charactéristiques qui definit l'être humain. Avec le développement du réseau neuronal artificiel, le domaine a connu une évolution rapide à la fois en terme de précision et de perception humaine. Une autre étape importante a été franchie avec le développement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modèle, ce qui augmente ainsi les performances, et ce qui simplifie la procédure d'entrainement. Les modèles de bout en bout sont devenus réalisables avec la quantité croissante de données disponibles, de ressources informatiques et, surtout, avec de nombreux développements architecturaux innovateurs. Néanmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des données difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variété de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composé de différents bruits inconnus, comme une tâche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique à l'époque de l'adaptation du domaine antagoniste. En résumé, ces travaux antérieurs proposaient de former des caractéristiques de manière à ce qu'elles soient distinctives pour la tâche principale, mais non-distinctive pour la tâche secondaire. Cette tâche secondaire est conçue pour être la tâche de reconnaissance de domaine. Ainsi, les fonctionnalités entraînées sont invariantes vis-à-vis du domaine considéré. Dans notre travail, nous adoptons cette technique et la modifions pour la tâche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous développons une méthode générale pour la régularisation des réseaux génératif récurrents. Il est connu que les réseaux récurrents ont souvent des difficultés à rester sur le même chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des réseaux bidirectionnels pour une meilleure traitement de séquences pour l'apprentissage des charactéristiques, qui n'est pas applicable au cas génératif. Nous avons développé un moyen d'améliorer la cohérence de la production de longues séquences avec des réseaux récurrents. Nous proposons un moyen de construire un modèle similaire à un réseau bidirectionnel. L'idée centrale est d'utiliser une perte L2 entre les réseaux récurrents génératifs vers l'avant et vers l'arrière. Nous fournissons une évaluation expérimentale sur une multitude de tâches et d'ensembles de données, y compris la reconnaissance vocale, le sous-titrage d'images et la modélisation du langage. Dans le troisième article, nous étudions la possibilité de développer un identificateur d'intention de bout en bout pour la compréhension du langage parlé. La compréhension sémantique du langage parlé est une étape importante vers le développement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances élevées sur les tâches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antérieurs pour développer un système de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

    Regularization and Compression of Deep Neural Networks

    Get PDF
    Deep neural networks (DNN) are the state-of-the-art machine learning models outperforming traditional machine learning methods in a number of domains from vision and speech to natural language understanding and autonomous control. With large amounts of data becoming available, the task performance of DNNs in these domains predictably scales with the size of the DNNs. However, in data-scarce scenarios, large DNNs overfit to the training dataset resulting in inferior performance. Additionally, in scenarios where enormous amounts of data is available, large DNNs incur large inference latencies and memory costs. Thus, while imperative for achieving state-of-the-art performances, large DNNs require large amounts of data for training and large computational resources during inference. These two problems could be mitigated by sparsely training large DNNs. Imposing sparsity constraints during training limits the capacity of the model to overfit to the training set while still being able to obtain good generalization. Sparse DNNs have most of their weights close to zero after training. Therefore, most of the weights could be removed resulting in smaller inference costs. To effectively train sparse DNNs, this thesis proposes two new sparse stochastic regularization techniques called Bridgeout and Sparseout. Furthermore, Bridgeout is used to prune convolutional neural networks for low-cost inference. Bridgeout randomly perturbs the weights of a parametric model such as a DNN. It is theoretically shown that Bridgeout constrains the weights of linear models to a sparse subspace. Empirically, Bridgeout has been shown to perform better, on image classification tasks, than state-of-the-art DNNs when the data is limited. Sparseout is an activations counter-part of Bridgeout, operating on the outputs of the neurons instead of the weights of the neurons. Theoretically, Sparseout has been shown to be a general case of the commonly used Dropout regularization method. Empirical evidence suggests that Sparseout is capable of controlling the level of activations sparsity in neural networks. This flexibility allows Sparseout to perform better than Dropout on image classification and language modelling tasks. Furthermore, using Sparseout, it is found that activation sparsity is beneficial to recurrent neural networks for language modeling but densification of activations favors convolutional neural networks for image classification. To address the problem of high computational cost during inference, this thesis evaluates Bridgeout for pruning convolutional neural networks (CNN). It is shown that recent CNN architectures such as VGG, ResNet and Wide-ResNet trained with Bridgeout are more robust to one-shot filter pruning compared to non-sparse stochastic regularization

    Hydrocarbon quantification using neural networks and deep learning based hyperspectral unmixing

    Get PDF
    Hydrocarbon (HC) spills are a global issue, which can seriously impact human life and the environment, therefore early identification and remedial measures taken at an early stage are important. Thus, current research efforts aim at remotely quantifying incipient quantities of HC mixed with soils. The increased spectral and spatial resolution of hyperspectral sensors has opened ground-breaking perspectives in many industries including remote inspection of large areas and the environment. The use of subpixel detection algorithms, and in particular the use of the mixture models, has been identified as a future advance that needs to be incorporated in remote sensing. However, there are some challenging tasks since the spectral signatures of the targets of interest may not be immediately available. Moreover, real time processing and analysis is required to support fast decision-making. Progressing in this direction, this thesis pioneers and researches novel methodologies for HC quantification capable of exceeding the limitations of existing systems in terms of reduced cost and processing time with improved accuracy. Therefore the goal of this research is to develop, implement and test different methods for improving HC detection and quantification using spectral unmixing and machine learning. An efficient hybrid switch method employing neural networks and hyperspectral is proposed and investigated. This robust method switches between state of the art hyperspectral unmixing linear and nonlinear models, respectively. This procedure is well suited for the quantification of small quantities of substances within a pixel with high accuracy as the most appropriate model is employed. Central to the proposed approach is a novel method for extracting parameters to characterise the non-linearity of the data. These parameters are fed into a feedforward neural network which decides in a pixel by pixel fashion which model is more suitable. The quantification process is fully automated by applying further classification techniques to the acquired hyperspectral images. A deep learning neural network model is designed for the quantification of HC quantities mixed with soils. A three-term backpropagation algorithm with dropout is proposed to avoid overfitting and reduce the computational complexity of the model. The above methods have been evaluated using classical repository datasets from the literature and a laboratory controlled dataset. For that, an experimental procedure has been designed to produce a labelled dataset. The data was obtained by mixing and homogenizing different soil types with HC substances, respectively and measuring the reflectance with a hyperspectral sensor. Findings from the research study reveal that the two proposed models have high performance, they are suitable for the detection and quantification of HC mixed with soils, and surpass existing methods. Improvements in sensitivity, accuracy, computational time are achieved. Thus, the proposed approaches can be used to detect HC spills at an early stage in order to mitigate significant pollution from the spill areas

    Manifold Learning Approaches to Compressing Latent Spaces of Unsupervised Feature Hierarchies

    Get PDF
    Field robots encounter dynamic unstructured environments containing a vast array of unique objects. In order to make sense of the world in which they are placed, they collect large quantities of unlabelled data with a variety of sensors. Producing robust and reliable applications depends entirely on the ability of the robot to understand the unlabelled data it obtains. Deep Learning techniques have had a high level of success in learning powerful unsupervised representations for a variety of discriminative and generative models. Applying these techniques to problems encountered in field robotics remains a challenging endeavour. Modern Deep Learning methods are typically trained with a substantial labelled dataset, while datasets produced in a field robotics context contain limited labelled training data. The primary motivation for this thesis stems from the problem of applying large scale Deep Learning models to field robotics datasets that are label poor. While the lack of labelled ground truth data drives the desire for unsupervised methods, the need for improving the model scaling is driven by two factors, performance and computational requirements. When utilising unsupervised layer outputs as representations for classification, the classification performance increases with layer size. Scaling up models with multiple large layers of features is problematic, as the sizes of subsequent hidden layers scales with the size of the previous layer. This quadratic scaling, and the associated time required to train such networks has prevented adoption of large Deep Learning models beyond cluster computing. The contributions in this thesis are developed from the observation that parameters or filter el- ements learnt in Deep Learning systems are typically highly structured, and contain related ele- ments. Firstly, the structure of unsupervised filters is utilised to construct a mapping from the high dimensional filter space to a low dimensional manifold. This creates a significantly smaller repre- sentation for subsequent feature learning. This mapping, and its effect on the resulting encodings, highlights the need for the ability to learn highly overcomplete sets of convolutional features. Driven by this need, the unsupervised pretraining of Deep Convolutional Networks is developed to include a number of modern training and regularisation methods. These pretrained models are then used to provide initialisations for supervised convolutional models trained on low quantities of labelled data. By utilising pretraining, a significant increase in classification performance on a number of publicly available datasets is achieved. In order to apply these techniques to outdoor 3D Laser Illuminated Detection And Ranging data, we develop a set of resampling techniques to provide uniform input to Deep Learning models. The features learnt in these systems outperform the high effort hand engineered features developed specifically for 3D data. The representation of a given signal is then reinterpreted as a combination of modes that exist on the learnt low dimensional filter manifold. From this, we develop an encoding technique that allows the high dimensional layer output to be represented as a combination of low dimensional components. This allows the growth of subsequent layers to only be dependent on the intrinsic dimensionality of the filter manifold and not the number of elements contained in the previous layer. Finally, the resulting unsupervised convolutional model, the encoding frameworks and the em- bedding methodology are used to produce a new unsupervised learning stratergy that is able to encode images in terms of overcomplete filter spaces, without producing an explosion in the size of the intermediate parameter spaces. This model produces classification results on par with state of the art models, yet requires significantly less computational resources and is suitable for use in the constrained computation environment of a field robot

    Continual deep learning via progressive learning

    Get PDF
    Machine learning is one of several approaches to artificial intelligence. It allows us to build machines that can learn from experience as opposed to being explicitly programmed. Current machine learning formulations are mostly designed for learning and performing a particular task from a tabula rasa using data available for that task. For machine learning to converge to artificial intelligence, in addition to other desiderata, it must be in a state of continual learning, i.e., have the ability to be in a continuous learning process, such that when a new task is presented, the system can leverage prior knowledge from prior tasks, in learning and performing this new task, and augment the prior knowledge with the newly acquired knowledge without having a significant adverse effect on the prior knowledge. Continual learning is key to advancing machine learning and artificial intelligence. Deep learning is a powerful general-purpose approach to machine learning that is able to solve numerous and various tasks with minimal modification. Deep learning extends machine learning, and specially neural networks, to learn multiple levels of distributed representations together with the required mapping function into a single composite function. The emergence of deep learning and neural networks as a generic approach to machine learning, coupled with their ability to learn versatile hierarchical representations, has paved the way for continual learning. The main aim of this thesis is the study and development of a structured approach to continual learning, leveraging the success of deep learning and neural networks. This thesis studies the application of deep learning to a number of supervised learning tasks, and in particular, classification tasks in machine perception, e.g., image recognition, automatic speech recognition, and speech emotion recognition. The relation between the systems developed for these tasks is investigated to illuminate the layer-wise relevance of features in deep networks trained for these tasks via transfer learning, and these independent systems are unified into continual learning systems. The main contribution of this thesis is the construction and formulation of a deep learning framework, denoted progressive learning, that allows a holistic and systematic approach to continual learning. Progressive learning comprises a number of procedures that address the continual learning desiderata. It is shown that, when tasks are related, progressive learning leads to faster learning that converges to better generalization performance using less amounts of data and a smaller number of dedicated parameters, for the tasks studied in this thesis, by accumulating and leveraging knowledge learned across tasks in a continuous manner. It is envisioned that progressive learning is a step towards a fully general continual learning framework

    Connectionist multivariate density-estimation and its application to speech synthesis

    Get PDF
    Autoregressive models factorize a multivariate joint probability distribution into a product of one-dimensional conditional distributions. The variables are assigned an ordering, and the conditional distribution of each variable modelled using all variables preceding it in that ordering as predictors. Calculating normalized probabilities and sampling has polynomial computational complexity under autoregressive models. Moreover, binary autoregressive models based on neural networks obtain statistical performances similar to that of some intractable models, like restricted Boltzmann machines, on several datasets. The use of autoregressive probability density estimators based on neural networks to model real-valued data, while proposed before, has never been properly investigated and reported. In this thesis we extend the formulation of neural autoregressive distribution estimators (NADE) to real-valued data; a model we call the real-valued neural autoregressive density estimator (RNADE). Its statistical performance on several datasets, including visual and auditory data, is reported and compared to that of other models. RNADE obtained higher test likelihoods than other tractable models, while retaining all the attractive computational properties of autoregressive models. However, autoregressive models are limited by the ordering of the variables inherent to their formulation. Marginalization and imputation tasks can only be solved analytically if the missing variables are at the end of the ordering. We present a new training technique that obtains a set of parameters that can be used for any ordering of the variables. By choosing a model with a convenient ordering of the dimensions at test time, it is possible to solve any marginalization and imputation tasks analytically. The same training procedure also makes it practical to train NADEs and RNADEs with several hidden layers. The resulting deep and tractable models display higher test likelihoods than the equivalent one-hidden-layer models for all the datasets tested. Ensembles of NADEs or RNADEs can be created inexpensively by combining models that share their parameters but differ in the ordering of the variables. These ensembles of autoregressive models obtain state-of-the-art statistical performances for several datasets. Finally, we demonstrate the application of RNADE to speech synthesis, and confirm that capturing the phone-conditional dependencies of acoustic features improves the quality of synthetic speech. Our model generates synthetic speech that was judged by naive listeners as being of higher quality than that generated by mixture density networks, which are considered a state-of-the-art synthesis techniqu
    corecore