740 research outputs found

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Using Resources from a Closely-related Language to Develop ASR for a Very Under-resourced Language: A Case Study for Iban

    Get PDF
    International audienceThis paper presents our strategies for developing an automatic speech recognition system for Iban, an under-resourced language. We faced several challenges such as no pronunciation dictionary and lack of training material for building acoustic models. To overcome these problems, we proposed approaches which exploit resources from a closely-related language (Malay). We developed a semi-supervised method for building the pronunciation dictionary and applied cross-lingual strategies for improving acoustic models trained with very limited training data. Both approaches displayed very encouraging results, which show that data from a closely-related language, if available, can be exploited to build ASR for a new language. In the final part of the paper, we present a zero-shot ASR using Malay resources that can be used as an alternative method for transcribing Iban speech

    Temporally Varying Weight Regression for Speech Recognition

    Get PDF

    Survey of deep representation learning for speech emotion recognition

    Get PDF
    Traditionally, speech emotion recognition (SER) research has relied on manually handcrafted acoustic features using feature engineering. However, the design of handcrafted features for complex SER tasks requires significant manual eort, which impedes generalisability and slows the pace of innovation. This has motivated the adoption of representation learning techniques that can automatically learn an intermediate representation of the input signal without any manual feature engineering. Representation learning has led to improved SER performance and enabled rapid innovation. Its effectiveness has further increased with advances in deep learning (DL), which has facilitated \textit{deep representation learning} where hierarchical representations are automatically learned in a data-driven manner. This paper presents the first comprehensive survey on the important topic of deep representation learning for SER. We highlight various techniques, related challenges and identify important future areas of research. Our survey bridges the gap in the literature since existing surveys either focus on SER with hand-engineered features or representation learning in the general setting without focusing on SER

    Personalising speech-to-speech translation:Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis

    Get PDF
    In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics

    Learning representations for speech recognition using artificial neural networks

    Get PDF
    Learning representations is a central challenge in machine learning. For speech recognition, we are interested in learning robust representations that are stable across different acoustic environments, recording equipment and irrelevant inter– and intra– speaker variabilities. This thesis is concerned with representation learning for acoustic model adaptation to speakers and environments, construction of acoustic models in low-resource settings, and learning representations from multiple acoustic channels. The investigations are primarily focused on the hybrid approach to acoustic modelling based on hidden Markov models and artificial neural networks (ANN). The first contribution concerns acoustic model adaptation. This comprises two new adaptation transforms operating in ANN parameters space. Both operate at the level of activation functions and treat a trained ANN acoustic model as a canonical set of fixed-basis functions, from which one can later derive variants tailored to the specific distribution present in adaptation data. The first technique, termed Learning Hidden Unit Contributions (LHUC), depends on learning distribution-dependent linear combination coefficients for hidden units. This technique is then extended to altering groups of hidden units with parametric and differentiable pooling operators. We found the proposed adaptation techniques pose many desirable properties: they are relatively low-dimensional, do not overfit and can work in both a supervised and an unsupervised manner. For LHUC we also present extensions to speaker adaptive training and environment factorisation. On average, depending on the characteristics of the test set, 5-25% relative word error rate (WERR) reductions are obtained in an unsupervised two-pass adaptation setting. The second contribution concerns building acoustic models in low-resource data scenarios. In particular, we are concerned with insufficient amounts of transcribed acoustic material for estimating acoustic models in the target language – thus assuming resources like lexicons or texts to estimate language models are available. First we proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states. We also propose to perform multi-task adaptation with a structured output layer. We obtain consistent WERR reductions up to 6.4% in low-resource speaker-independent acoustic modelling. Adapting those models in a multi-task manner with LHUC decreases WERRs by an additional 13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that one can build better acoustic models with unsupervised multi– and cross– lingual initialisation and find that pre-training is a largely language-independent. Up to 14.4% WERR reductions are observed, depending on the amount of the available transcribed acoustic data in the target language. The third contribution concerns building acoustic models from multi-channel acoustic data. For this purpose we investigate various ways of integrating and learning multi-channel representations. In particular, we investigate channel concatenation and the applicability of convolutional layers for this purpose. We propose a multi-channel convolutional layer with cross-channel pooling, which can be seen as a data-driven non-parametric auditory attention mechanism. We find that for unconstrained microphone arrays, our approach is able to match the performance of the comparable models trained on beamform-enhanced signals

    Speech recognition with probabilistic transcriptions and end-to-end systems using deep learning

    Get PDF
    In this thesis, we develop deep learning models in automatic speech recognition (ASR) for two contrasting tasks characterized by the amounts of labeled data available for training. In the first half, we deal with scenarios when there are limited or no labeled data for training ASR systems. This situation is commonly prevalent in languages which are under-resourced. However, in the second half, we train ASR systems with large amounts of labeled data in English. Our objective is to improve modern end-to-end (E2E) ASR using attention modeling. Thus, the two primary contributions of this thesis are the following: Cross-Lingual Speech Recognition in Under-Resourced Scenarios: A well-resourced language is a language with an abundance of resources to support the development of speech technology. Those resources are usually defined in terms of 100+ hours of speech data, corresponding transcriptions, pronunciation dictionaries, and language models. In contrast, an under-resourced language lacks one or more of these resources. The most expensive and time-consuming resource is the acquisition of transcriptions due to the difficulty in finding native transcribers. The first part of the thesis proposes methods by which deep neural networks (DNNs) can be trained when there are limited or no transcribed data in the target language. Such scenarios are common for languages which are under-resourced. Two key components of this proposition are Transfer Learning and Crowdsourcing. Through these methods, we demonstrate that it is possible to borrow statistical knowledge of acoustics from a variety of other well-resourced languages to learn the parameters of a the DNN in the target under-resourced language. In particular, we use well-resourced languages as cross-entropy regularizers to improve the generalization capacity of the target language. A key accomplishment of this study is that it is the first to train DNNs using noisy labels in the target language transcribed by non-native speakers available in online marketplaces. End-to-End Large Vocabulary Automatic Speech Recognition: Recent advances in ASR have been mostly due to the advent of deep learning models. Such models have the ability to discover complex non-linear relationships between attributes that are usually found in real-world tasks. Despite these advances, building a conventional ASR system is a cumbersome procedure since it involves optimizing several components separately in a disjoint fashion. To alleviate this problem, modern ASR systems have adopted a new approach of directly transducing speech signals to text. Such systems are known as E2E systems and one such system is the Connectionist Temporal Classification (CTC). However, one drawback of CTC is the hard alignment problem as it relies only on the current input to generate the current output. In reality, the output at the current time is influenced not only by the current input but also by inputs in the past and the future. Thus, the second part of the thesis proposes advancing state-of-the-art E2E speech recognition for large corpora by directly incorporating attention modeling within the CTC framework. In attention modeling, inputs in the current, past, and future are distinctively weighted depending on the degree of influence they exert on the current output. We accomplish this by deriving new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we extract more reliable content information from a network representing an implicit language model. Finally, we used vector based attention weights that are applied on context vectors across both time and their individual components. A key accomplishment of this study is that it is the first to incorporate attention directly within the CTC network. Furthermore, we show that our proposed attention-based CTC model, even in the absence of an explicit language model, is able to achieve lower word error rates than a well-trained conventional ASR system equipped with a strong external language model

    Context-Dependent Acoustic Modelling for Speech Recognition

    Get PDF

    Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch

    Get PDF
    Walliserdeutsch is a Swiss German dialect spoken in the south west of Switzerland. To investigate the potential of automatic speech processing of Walliserdeutsch, a small database was collected based mainly on broadcast news from a local radio station. Experiments suggest that automatic speech recognition is feasible: use of another (Swiss German) database shows that the small data size lends itself to bootstrapping from other data; use of Kullback-Leibler HMM suggests that phoneme mapping techniques can compensate for a grapheme-based dictionary. Experiments also indicate that statistical machine translation is feasible; the difficulty of small data size is offset by the close proximity to (high) German