    Transformer-based Acoustic Modeling for Hybrid Speech Recognition

    We propose and evaluate transformer-based acoustic models (AMs) for hybrid speech recognition. Several modeling choices are discussed in this work, including various positional embedding methods and an iterated loss to enable training deep transformers. We also present a preliminary study of using limited right context in transformer models, which makes it possible for streaming applications. We demonstrate that on the widely used Librispeech benchmark, our transformer-based AM outperforms the best published hybrid result by 19% to 26% relative when the standard n-gram language model (LM) is used. Combined with neural network LM for rescoring, our proposed approach achieves state-of-the-art results on Librispeech. Our findings are also confirmed on a much larger internal dataset.

    Articulatory feature based continuous speech recognition using probabilistic lexical modeling

    Phonological studies suggest that the typical subword units such as phones or phonemes used in automatic speech recognition systems can be decomposed into a set of features based on the articulators used to produce the sound. Most of the current approaches to integrate articulatory feature (AF) representations into an automatic speech recognition (ASR) system are based on a deterministic knowledge-based phoneme-to-AF relationship. In this paper, we propose a novel two stage approach in the framework of probabilistic lexical modeling to integrate AF representations into an ASR system. In the first stage, the relationship between acoustic feature observations and various AFs is modeled. In the second stage, a probabilistic relationship between subword units and AFs is learned using transcribed speech data. Our studies on a continuous speech recognition task show that the proposed approach effectively integrates AFs into an ASR system. Furthermore, the studies show that either phonemes or graphemes can be used as subword units. Analysis of the probabilistic relationship captured by the parameters has shown that the approach is capable of adapting the knowledge-based phoneme-to-AF representations using speech data; and allows different AFs to evolve asynchronously

    Fast speaker independent large vocabulary continuous speech recognition

    Learning to adapt: meta-learning approaches for speaker adaptation

    The performance of automatic speech recognition systems degrades rapidly when there is a mismatch between training and testing conditions. One way to compensate for this mismatch is to adapt an acoustic model to test conditions, for example by performing speaker adaptation. In this thesis we focus on the discriminative model-based speaker adaptation approach. The success of this approach relies on having a robust speaker adaptation procedure – we need to specify which parameters should be adapted and how they should be adapted. Unfortunately, tuning the speaker adaptation procedure requires considerable manual effort. In this thesis we propose to formulate speaker adaptation as a meta-learning task. In meta-learning, learning occurs on two levels: a learner learns a task specific model and a meta-learner learns how to train these task specific models. In our case, the learner is a speaker dependent-model and the meta-learner learns to adapt a speaker-independent model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation schedules compared to adaptation procedures with handcrafted hyperparameters. Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we embed the gradient based adaptation directly into the training of the acoustic model. We hypothesise that this formulation should steer the training of the acoustic model into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and with SAT-LHUC, which represents a traditional speaker adaptive training method. We show that the meta-learning speaker-adaptive training approach achieves comparable results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC outperforms the baseline approach after adaptation. Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that SAT-LHUC interferes with batch normalisation, and that it benefits from an increased hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still favourable to train a speaker independent model with batch normalisation. As such, an effective way of training state-of-the-art SAT-LHUC models remains an open question. Finally, we show that the performance of unsupervised speaker adaptation can be further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to be adapted without overfitting being observed, and is successful even when the initial transcription has a WER in excess of 50%

    Developing Sparse Representations for Anchor-Based Voice Conversion

    Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes. We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations. Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”


    Learning disentangled speech representations

    A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody. The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions. In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks. This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically