273 research outputs found

    Transfer Learning for Speech and Language Processing

    Full text link
    Transfer learning is a vital technique that generalizes models trained for one setting or task to other settings or tasks. For example in speech recognition, an acoustic model trained for one language can be used to recognize speech in another language, with little or no re-training data. Transfer learning is closely related to multi-task learning (cross-lingual vs. multilingual), and is traditionally studied in the name of `model adaptation'. Recent advance in deep learning shows that transfer learning becomes much easier and more effective with high-level abstract features learned by deep models, and the `transfer' can be conducted not only between data distributions and data types, but also between model structures (e.g., shallow nets and deep nets) or even model types (e.g., Bayesian models and neural models). This review paper summarizes some recent prominent research towards this direction, particularly for speech and language processing. We also report some results from our group and highlight the potential of this very interesting research field.Comment: 13 pages, APSIPA 201

    Deep Multi-View Learning for Visual Understanding

    Get PDF
    PhD ThesisMulti-view data is the result of an entity being perceived or represented from multiple perspectives. Plenty of applications in visual understanding contain multi-view data. For example, the face images for training a recognition system are usually captured by different devices from multiple angles. This thesis focuses on the cross-view visual recognition problems, e.g., identifying the face images of the same person across different cameras. Several representative multi-view settings, from the supervised multi-view learning to the more challenging unsupervised domain adaptive (UDA) multi-view learning, are investigated. Novel multi-view learning algorithms are proposed correspondingly. To be more specific, the proposed methods are based on the advanced deep neural network (DNN) architectures for better handling visual data. However, directly combining the multi-view learning objectives with DNN can result in different issues, e.g., on scalability, and limit the application scenarios and model performance. Corresponding novelties in DNN methods are thus required to solve them. This thesis is organised into three parts. Each chapter focuses on a multi-view learning setting with novel solutions and is detailed as follows: Chapter 3 A supervised multi-view learning setting with two different views are studied. To recognise the data samples across views, one strategy is aligning them in a common feature space via correlation maximisation. It is also known as canonical correlation analysis (CCA). Deep CCA has been proposed for better performance with the non-linear projection via deep neural networks. Existing deep CCA models typically decorrelate the deep feature dimensions of each view before their Euclidean distances are minimised in the common space. This feature decorrelation is achieved by enforcing an exact decorrelation constraint which is computationally expensive due to the matrix inversion or SVD operations. Therefore, existing deep CCA models are inefficient and have scalability issues. Furthermore, the exact decorrelation is incompatible with the gradient based deep model training and results in sub-optimal solution. To overcome these aforementioned issues, a novel deep CCA model Soft CCA is introduced in this thesis. Specifically, the exact decorrelation is replaced by soft decorrelation via a mini-batch based Stochastic Decorrelation Loss (SDL). It can be jointly optimised with the other training objectives. In addition, our SDL loss can be applied to other deep models beyond multi-view learning. Chapter 4 The supervised multi-view learning setting, whereby more than two views exist, are studied in this chapter. Recently developed deep multi-view learning algorithms either learn a latent visual representation based on a single semantic level and/or require laborious human annotation of these factors as attributes. A novel deep neural network architecture, called Multi- Level Factorisation Net (MLFN), is proposed to automatically factorise the visual appearance into latent discriminative factors at multiple semantic levels without manual annotation. The main purpose is forcing different views share the same latent factors so that they are can be aligned at all layers. Specifically, MLFN is composed of multiple stacked blocks. Each block contains multiple factor modules to model latent factors at a specific level, and factor selection modules that dynamically select the factor modules to interpret the content of each input image. The outputs of the factor selection modules also provide a compact latent factor descriptor that is complementary to the conventional deeply learned feature, and they can be fused efficiently. The effectiveness of the proposed MLFN is demonstrated by not only the large-scale cross-view recognition problems but also the general object categorisation tasks. Chapter 5 The last problem is a special unsupervised domain adaptation setting called unsupervised domain adaptive (UDA) multi-view learning. It contains a fully annotated dataset as the source domain and another unsupervised dataset with relevant tasks as the target domain. The main purpose is to improve the performance of the unlabelled dataset with the annotated data from the other dataset. More importantly, this setting further requires both the source and target domains are multi-view datasets with relevant tasks. Therefore, the assumption of the aligned label space across domains is inappropriate in the UDA multi-view learning. For example, the person re-identification (Re-ID) datasets built on different surveillance scenarios are with images of different people captured and should be given disjoint person identity labels. Existing methods for UDA multi-view learning problems are aligning different domains either in the raw image space or a feature embedding space for domain alignment. In this thesis, a different framework, multi-task learning, is adopted with the domain specific objectives for a common space learning. Specifically, such common space is proposed to enable the knowledge transfer. The conventional supervised losses can be used for the labelled source data while the unsupervised objectives for the target domain play the key roles in domain adaptation. Two novel unsupervised objectives are introduced for UDA multi-view learning and result in two models as below. The first model, termed common factorised space model (CFSM), is built on the assumptions that the semantic latent attributes are shared between the source and target domains since they are relevant multi-view learning tasks. Different from the existing methods that based on domain alignment, CFSM emphasizes on transferring the information across domains via discovering discriminative latent factors in the proposed common space. However, the multi-view data from target domain is without labels. Therefore, an unsupervised factorisation loss is derived and applied on the common space for latent factors discovery across domains. The second model still learns a shared embedding space with multi-view data from both domains but with a different assumption. It attempts to discover the latent correspondence of multi-view data in the unsupervised target data. The target data’s contribution comes from a clustering process. Each cluster thus reveals the underlying cross-view correspondences across multiple views in target domain. To this end, a novel Stochastic Inference for Deep Clustering (SIDC) method is proposed. It reduces self-reinforcing errors that lead to premature convergence to a sub-optimal solution by changing the conventional deterministic cluster assignment to a stochastic one

    Robust learning of acoustic representations from diverse speech data

    Get PDF
    Automatic speech recognition is increasingly applied to new domains. A key challenge is to robustly learn, update and maintain representations to cope with transient acoustic conditions. A typical example is broadcast media, for which speakers and environments may change rapidly, and available supervision may be poor. The concern of this thesis is to build and investigate methods for acoustic modelling that are robust to the characteristics and transient conditions as embodied by such media. The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio with approximate labels, but training methods can be sensitive to label errors, and their use is therefore not trivial. State-of-the-art semi-supervised training makes effective use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid overfitting to poor supervision, but does not make use of the transcriptions. Existing approaches that do aim to make use of the transcriptions typically employ an algorithm to filter or combine the transcriptions with the recognition output from a seed model, but the final result does not encode uncertainty. We propose a method to combine the lattice output from a biased recognition pass with the transcripts, crucially preserving uncertainty in the lattice where appropriate. This substantially reduces the word error rate on a broadcast task. The second contribution is a method to factorise representations for speakers and environments so that they may be combined in novel combinations. In realistic scenarios, the speaker or environment transform at test time might be unknown, or there may be insufficient data to learn a joint transform. We show that in such cases, factorised, or independent, representations are required to avoid deteriorating performance. Using i-vectors, we factorise speaker or environment information using multi-condition training with neural networks. Specifically, we extract bottleneck features from networks trained to classify either speakers or environments. The resulting factorised representations prove beneficial when one factor is missing at test time, or when all factors are seen, but not in the desired combination. The third contribution is an investigation of model adaptation in a longitudinal setting. In this scenario, we repeatedly adapt a model to new data, with the constraint that previous data becomes unavailable. We first demonstrate the effect of such a constraint, and show that using a cyclical learning rate may help. We then observe that these successive models lend themselves well to ensembling. Finally, we show that the impact of this constraint in an active learning setting may be detrimental to performance, and suggest to combine active learning with semi-supervised training to avoid biasing the model. The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature extractor, known as SincNet. In contrast to traditional techniques that warp the filterbank frequencies in standard feature extraction, adapting SincNet parameters is more flexible and more readily optimised, whilst maintaining interpretability. On a task adapting from adult to child speech, we show that this layer is well suited for adaptation and is very effective with respect to the small number of adapted parameters

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Deep representation learning for speech recognition

    Get PDF
    Representation learning is a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. For speech recognition, such a representation should contain the information needed to perform well in this task. A robust representation should also be reusable, hence it should capture the structure of the data. Interpretability is another desired characteristic. In this thesis we strive to learn an optimal deep representation for speech recognition using feed-forward Neural Networks (NNs) with different connectivity patterns. First and foremost, we aim to improve the robustness of the acoustic models. We use attribute-aware and adaptive training strategies to model the underlying factors of variation related to the speakers and the acoustic conditions. We focus on low-latency and real-time decoding scenarios. We explore different utterance summaries (referred to as utterance embeddings), capturing various sources of speech variability, and we seek to optimise speaker adaptive training (SAT) with control networks acting on the embeddings. We also propose a multi-scale CNN layer, to learn factorised representations. The proposed multi-scale approach also tackles the computational and memory efficiency. We also present a number of different approaches as an attempt to better understand learned representations. First, with a controlled design, we aim to assess the role of individual components of deep CNN acoustic models. Next, with saliency maps, we evaluate the importance of each input feature with respect to the classification criterion. Then, we propose to evaluate layer-wise and model-wise learned representations in different diagnostic verification tasks (speaker and acoustic condition verification). We propose a deep CNN model as the embedding extractor, merging the information learned at different layers in the network. Similarly, we perform the analyses for the embeddings used in SAT-DNNs to gain more insight. For the multi-scale models, we also show how to compare learned representations (and assess their robustness) with a metric invariant to affine transformations

    Learning Hidden Unit Contributions for Unsupervised Acoustic Model Adaptation

    Get PDF
    This work presents a broad study on the adaptation of neural network acoustic models by means of learning hidden unit contributions (LHUC) -- a method that linearly re-combines hidden units in a speaker- or environment-dependent manner using small amounts of unsupervised adaptation data. We also extend LHUC to a speaker adaptive training (SAT) framework that leads to a more adaptable DNN acoustic model, working both in a speaker-dependent and a speaker-independent manner, without the requirements to maintain auxiliary speaker-dependent feature extractors or to introduce significant speaker-dependent changes to the DNN structure. Through a series of experiments on four different speech recognition benchmarks (TED talks, Switchboard, AMI meetings, and Aurora4) comprising 270 test speakers, we show that LHUC in both its test-only and SAT variants results in consistent word error rate reductions ranging from 5% to 23% relative depending on the task and the degree of mismatch between training and test data. In addition, we have investigated the effect of the amount of adaptation data per speaker, the quality of unsupervised adaptation targets, the complementarity to other adaptation techniques, one-shot adaptation, and an extension to adapting DNNs trained in a sequence discriminative manner.Comment: 14 pages, 9 Tables, 11 Figues in IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 24, Num. 8, 201

    Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview

    Get PDF
    We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figure

    Learning Hidden Unit Contributions for Unsupervised Speaker Adaptation of Neural Network Acoustic Models

    Get PDF
    • …
    corecore