32 research outputs found

    Multilingual Deep Neural Network based Acoustic Modeling For Rapid Language Adaptation

    Get PDF
    This paper presents a study on multilingual deep neural network (DNN) based acoustic modeling and its application to new languages. We investigate the effect of phone merging on multilingual DNN in context of rapid language adaptation. Moreover, the combination of multilingual DNNs with Kullback--Leibler divergence based acoustic modeling (KL-HMM) is explored. Using ten different languages from the Globalphone database, our studies reveal that crosslingual acoustic model transfer through multilingual DNNs is superior to unsupervised RBM pre-training and greedy layer-wise supervised training. We also found that KL-HMM based decoding consistently outperforms conventional hybrid decoding, especially in low-resource scenarios. Furthermore, the experiments indicate that multilingual DNN training equally benefits from simple phoneset concatenation and manually derived universal phonesets

    GREC: Multi-domain Speech Recognition for the Greek Language

    Get PDF
    Μία από τις κορυφαίες προκλήσεις στην Αυτόματη Αναγνώριση Ομιλίας είναι η ανάπτυξη ικανών συστημάτων που μπορούν να έχουν ισχυρή απόδοση μέσα από διαφορετικές συνθήκες ηχογράφησης. Στο παρόν έργο κατασκευάζουμε και αναλύουμε το GREC, μία μεγάλη πολυτομεακή συλλογή δεδομένων για αυτόματη αναγνώριση ομιλίας στην ελληνική γλώσσα. Το GREC αποτελείται από τρεις βάσεις δεδομένων στους θεματικούς τομείς των «εκπομπών ειδήσεων», «ομιλίας από δωρισμένες εγγραφές φωνής», «ηχητικών βιβλίων» και μιας νέας συλλογής δεδομένων στον τομέα των «πολιτικών ομιλιών». Για τη δημιουργία του τελευταίου, συγκεντρώνουμε δεδομένα ομιλίας από ηχογραφήσεις των επίσημων συνεδριάσεων της Βουλής των Ελλήνων, αποδίδοντας ένα σύνολο δεδομένων που αποτελείται από 120 ώρες ομιλίας πολιτικού περιεχομένου. Περιγράφουμε με λεπτομέρεια την καινούρια συλλογή δεδομένων, την προεπεξεργασία και την ευθυγράμμιση ομιλίας, τα οποία βασίζονται στο εργαλείο ανοιχτού λογισμικού Kaldi. Επιπλέον, αξιολογούμε την απόδοση των μοντέλων Gaussian Mixture (GMM) - Hidden Markov (HMM) και Deep Neural Network (DNN) - HMM όταν εφαρμόζονται σε δεδομένα από διαφορετικούς τομείς. Τέλος, προσθέτουμε τη δυνατότητα αυτόματης δεικτοδότησης ομιλητών στο Kaldi-gRPC-Server, ενός εργαλείου γραμμένο σε Python που βασίζεται στο PyKaldi και στο gRPC για βελτιωμένη ανάπτυξη μοντέλων αυτόματης αναγνώρισης ομιλίας.One of the leading challenges in Automatic Speech Recognition (ASR) is the development of robust systems that can perform well under multiple settings. In this work we construct and analyze GREC, a large, multi-domain corpus for automatic speech recognition for the Greek language. GREC is a collection of three available subcorpora over the domains of “news casts”, “crowd-sourced speech”, “audiobooks”, and a new corpus in the domain of “public speeches”. For the creation of the latter, HParl, we collect speech data from recordings of the official proceedings of the Hellenic Parliament, yielding, a dataset which consists of 120 hours of political speech segments. We describe our data collection, pre-processing and alignment setup, which are based on Kaldi toolkit. Furthermore, we perform extensive ablations on the recognition performance of Gaussian Mixture (GMM) - Hidden Markov (HMM) models and Deep Neural Network (DNN) - HMM models over the different domains. Finally, we integrate speaker diarization features to Kaldi-gRPC-Server, a modern, pythonic tool based on PyKaldi and gRPC for streamlined deployment of Kaldi based speech recognition

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Low latency modeling of temporal contexts for speech recognition

    Get PDF
    This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

    X-VECTORS: ROBUST NEURAL EMBEDDINGS FOR SPEAKER RECOGNITION

    Get PDF
    Speaker recognition is the task of identifying speakers based on their speech signal. Typically, this involves comparing speech from a known speaker, with recordings from unknown speakers, and making same-or-different speaker decisions. If the lexical contents of the recordings are fixed to some phrase, the task is considered text-dependent, otherwise it is text-independent. This dissertation is primarily concerned with this second, less constrained problem. Since speech data lives in a complex, high-dimensional space, it is difficult to directly compare speakers. Comparisons are facilitated by embeddings: mappings from complex input patterns to low-dimensional Euclidean spaces where notions of distance or similarity are defined in natural ways. For almost ten years, systems based on i-vectors--a type of embedding extracted from a traditional generative model--have been the dominant paradigm in this field. However, in other areas of applied machine learning, such as text or vision, embeddings extracted from discriminatively trained neural networks are the state-of-the-art. Recently, this line of research has become very active in speaker recognition as well. Neural networks are a natural choice for this purpose, as they are capable of learning extremely complex mappings, and when training data resources are abundant, tend to outperform traditional methods. In this dissertation, we develop a next-generation neural embedding--denoted by x-vector--for speaker recognition. These neural embeddings are demonstrated to substantially improve upon the state-of-the-art on a number of benchmark datasets

    Spoken command recognition for robotics

    Get PDF
    In this thesis, I investigate spoken command recognition technology for robotics. While high robustness is expected, the distant and noisy conditions in which the system has to operate make the task very challenging. Unlike commercial systems which all rely on a "wake-up" word to initiate the interaction, the pipeline proposed here directly detect and recognizes commands from the continuous audio stream. In order to keep the task manageable despite low-resource conditions, I propose to focus on a limited set of commands, thus trading off flexibility of the system against robustness. Domain and speaker adaptation strategies based on a multi-task regularization paradigm are first explored. More precisely, two different methods are proposed which rely on a tied loss function which penalizes the distance between the output of several networks. The first method considers each speaker or domain as a task. A canonical task-independent network is jointly trained with task-dependent models, allowing both types of networks to improve by learning from one another. While an improvement of 3.2% on the frame error rate (FER) of the task-independent network is obtained, this only partially carried over to the phone error rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel training of the canonical network with a privileged model having access to i-vectors. This method proved less effective with only 1.2% of improvement on the FER. In order to make the developed technology more accessible, I also investigated the use of a sequence-to-sequence (S2S) architecture for command classification. The use of an attention-based encoder-decoder model reduced the classification error by 40% relative to a strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing the relevance of S2S architectures in such context. In order to improve the flexibility of the trained system, I also explored strategies for few-shot learning, which allow to extend the set of commands with minimum requirements in terms of data. Retraining a model on the combination of original and new commands, I managed to achieve 40.5% of accuracy on the new commands with only 10 examples for each of them. This scores goes up to 81.5% of accuracy with a larger set of 100 examples per new command. An alternative strategy, based on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10 and 100 examples respectively, while being faster to train. This high performance is obtained at the expense of the original categories though, on which the accuracy deteriorated. Those results are very promising as the methods allow to easily extend an existing S2S model with minimal resources. Finally, a full spoken command recognition system (named iCubrec) has been developed for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to propose a fully hand-free experience. By segmenting only regions that are likely to contain commands, the VAD module also allows to reduce greatly the computational cost of the pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM command recognition system for transcription. The VoCub dataset has been specifically gathered to train a DNN-based acoustic model for our task. Through multi-condition training with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model, complemented by a rejection mechanism based on a confidence score, is finally added to the system to reject non-command speech in a live demonstration of the system

    Neural approaches to spoken content embedding

    Full text link
    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi
    corecore