18 research outputs found
Learning representations for speech recognition using artificial neural networks
Learning representations is a central challenge in machine learning. For speech
recognition, we are interested in learning robust representations that are stable
across different acoustic environments, recording equipment and irrelevant inter–
and intra– speaker variabilities. This thesis is concerned with representation
learning for acoustic model adaptation to speakers and environments, construction
of acoustic models in low-resource settings, and learning representations from
multiple acoustic channels. The investigations are primarily focused on the hybrid
approach to acoustic modelling based on hidden Markov models and artificial
neural networks (ANN).
The first contribution concerns acoustic model adaptation. This comprises
two new adaptation transforms operating in ANN parameters space. Both operate
at the level of activation functions and treat a trained ANN acoustic model as
a canonical set of fixed-basis functions, from which one can later derive variants
tailored to the specific distribution present in adaptation data. The first technique,
termed Learning Hidden Unit Contributions (LHUC), depends on learning
distribution-dependent linear combination coefficients for hidden units. This
technique is then extended to altering groups of hidden units with parametric and
differentiable pooling operators. We found the proposed adaptation techniques
pose many desirable properties: they are relatively low-dimensional, do not overfit
and can work in both a supervised and an unsupervised manner. For LHUC we
also present extensions to speaker adaptive training and environment factorisation.
On average, depending on the characteristics of the test set, 5-25% relative
word error rate (WERR) reductions are obtained in an unsupervised two-pass
adaptation setting.
The second contribution concerns building acoustic models in low-resource
data scenarios. In particular, we are concerned with insufficient amounts of
transcribed acoustic material for estimating acoustic models in the target language
– thus assuming resources like lexicons or texts to estimate language models
are available. First we proposed an ANN with a structured output layer
which models both context–dependent and context–independent speech units,
with the context-independent predictions used at runtime to aid the prediction
of context-dependent states. We also propose to perform multi-task adaptation
with a structured output layer. We obtain consistent WERR reductions up to
6.4% in low-resource speaker-independent acoustic modelling. Adapting those
models in a multi-task manner with LHUC decreases WERRs by an additional
13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that
one can build better acoustic models with unsupervised multi– and cross– lingual
initialisation and find that pre-training is a largely language-independent. Up to
14.4% WERR reductions are observed, depending on the amount of the available
transcribed acoustic data in the target language.
The third contribution concerns building acoustic models from multi-channel
acoustic data. For this purpose we investigate various ways of integrating and
learning multi-channel representations. In particular, we investigate channel concatenation
and the applicability of convolutional layers for this purpose. We
propose a multi-channel convolutional layer with cross-channel pooling, which
can be seen as a data-driven non-parametric auditory attention mechanism. We
find that for unconstrained microphone arrays, our approach is able to match the
performance of the comparable models trained on beamform-enhanced signals
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
The subspace Gaussian mixture model—A structured model for speech recognition
We describe a new approach to speech recognition, in which all Hidden Markov Model (HMM) states share the same Gaussian Mixture Model (GMM) structure with the same number of Gaussians in each state. The model is defined by vectors associated with each state with a dimension of, say, 50, together with a global mapping from this vector space to the space of parameters of the GMM. This model appears to give better results than a conventional model, and the extra structure offers many new opportunities for modeling innovations while maintaining compatibility with most standard techniques
Subspace Gaussian mixture models for automatic speech recognition
In most of state-of-the-art speech recognition systems, Gaussian mixture models (GMMs)
are used to model the density of the emitting states in the hidden Markov models
(HMMs). In a conventional system, the model parameters of each GMM are estimated
directly and independently given the alignment. This results a large number of
model parameters to be estimated, and consequently, a large amount of training data
is required to fit the model. In addition, different sources of acoustic variability that
impact the accuracy of a recogniser such as pronunciation variation, accent, speaker
factor and environmental noise are only weakly modelled and factorized by adaptation
techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori
adaptation (MAP) and vocal tract length normalisation (VTLN). In this thesis,
we will discuss an alternative acoustic modelling approach — the subspace Gaussian
mixture model (SGMM), which is expected to deal with these two issues better. In an
SGMM, the model parameters are derived from low-dimensional model and speaker
subspaces that can capture phonetic and speaker correlations. Given these subspaces,
only a small number of state-dependent parameters are required to derive the corresponding
GMMs. Hence, the total number of model parameters can be reduced, which
allows acoustic modelling with a limited amount of training data. In addition, the
SGMM-based acoustic model factorizes the phonetic and speaker factors and within
this framework, other source of acoustic variability may also be explored.
In this thesis, we propose a regularised model estimation for SGMMs, which avoids
overtraining in case that the training data is sparse. We will also take advantage of
the structure of SGMMs to explore cross-lingual acoustic modelling for low-resource
speech recognition. Here, the model subspace is estimated from out-domain data and
ported to the target language system. In this case, only the state-dependent parameters
need to be estimated which relaxes the requirement of the amount of training data. To
improve the robustness of SGMMs against environmental noise, we propose to apply
the joint uncertainty decoding (JUD) technique that is shown to be efficient and effective.
We will report experimental results on the Wall Street Journal (WSJ) database
and GlobalPhone corpora to evaluate the regularisation and cross-lingual modelling of
SGMMs. Noise compensation using JUD for SGMM acoustic models is evaluated on
the Aurora 4 database
Temporally Varying Weight Regression for Speech Recognition
Ph.DDOCTOR OF PHILOSOPH
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Deep representation learning for speech recognition
Representation learning is a fundamental ingredient of deep learning. However, learning a good representation is a challenging task. For speech recognition, such a representation should contain the information needed to perform well in this task. A robust representation should also be reusable, hence it should capture the structure of the data. Interpretability is another desired characteristic. In this thesis we strive to learn an optimal deep representation for speech recognition using feed-forward Neural Networks (NNs) with different connectivity patterns.
First and foremost, we aim to improve the robustness of the acoustic models. We use attribute-aware and adaptive training strategies to model the underlying factors of variation related to the speakers and the acoustic conditions. We focus on low-latency and real-time decoding scenarios. We explore different utterance summaries (referred to as utterance embeddings), capturing various sources of speech variability, and we seek to optimise speaker adaptive training (SAT) with control networks acting on the embeddings. We also propose a multi-scale CNN layer, to learn factorised representations. The proposed multi-scale approach also tackles the computational and memory efficiency.
We also present a number of different approaches as an attempt to better understand learned representations. First, with a controlled design, we aim to assess the role of individual components of deep CNN acoustic models. Next, with saliency maps, we evaluate the importance of each input feature with respect to the classification criterion. Then, we propose to evaluate layer-wise and model-wise learned representations in different diagnostic verification tasks (speaker and acoustic condition verification). We propose a deep CNN model as the embedding extractor, merging the information learned at different layers in the network. Similarly, we perform the analyses for the embeddings used in SAT-DNNs to gain more insight. For the multi-scale models, we also show how to compare learned representations (and assess their robustness) with a metric invariant to affine transformations
Speech recognition with probabilistic transcriptions and end-to-end systems using deep learning
In this thesis, we develop deep learning models in automatic speech recognition (ASR) for two contrasting tasks characterized by the amounts of labeled data available for training. In the first half, we deal with scenarios when there are limited or no labeled data for training ASR systems. This situation is commonly prevalent in languages which are under-resourced. However, in the second half, we train ASR systems with large amounts of labeled data in English. Our objective is to improve modern end-to-end (E2E) ASR using attention modeling. Thus, the two primary contributions of this thesis are the following:
Cross-Lingual Speech Recognition in Under-Resourced Scenarios:
A well-resourced language is a language with an abundance of resources to support the development of speech technology. Those resources are usually defined in terms of 100+ hours of speech data, corresponding transcriptions, pronunciation dictionaries, and language models. In contrast, an under-resourced language lacks one or more of these resources. The most expensive and time-consuming resource is the acquisition of transcriptions due to the difficulty in finding native transcribers. The first part of the thesis proposes methods by which deep neural networks (DNNs) can be trained when there are limited or no transcribed data in the target language. Such scenarios are common for languages which are under-resourced.
Two key components of this proposition are Transfer Learning and Crowdsourcing. Through these methods, we demonstrate that it is possible to borrow statistical knowledge of acoustics from a variety of other well-resourced languages to learn the parameters of a the DNN in the target under-resourced language. In particular, we use well-resourced languages as cross-entropy regularizers to improve the generalization capacity of the target language. A key accomplishment of this study is that it is the first to train DNNs using noisy labels in the target language transcribed by non-native speakers available in online marketplaces.
End-to-End Large Vocabulary Automatic Speech Recognition:
Recent advances in ASR have been mostly due to the advent of deep learning models. Such models have the ability to discover complex non-linear relationships between attributes that are usually found in real-world tasks. Despite these advances, building a conventional ASR system is a cumbersome procedure since it involves optimizing several components separately in a disjoint fashion. To alleviate this problem, modern ASR systems have adopted a new approach of directly transducing speech signals to text. Such systems are known as E2E systems and one such system is the Connectionist Temporal Classification (CTC). However, one drawback of CTC is the hard alignment problem as it relies only on the current input to generate the current output. In reality, the output at the current time is influenced not only by the current input but also by inputs in the past and the future.
Thus, the second part of the thesis proposes advancing state-of-the-art E2E speech recognition for large corpora by directly incorporating attention modeling within the CTC framework. In attention modeling, inputs in the current, past, and future are distinctively weighted depending on the degree of influence they exert on the current output. We accomplish this by deriving new context vectors using time convolution features to model attention as part of the CTC network. To further improve attention modeling, we extract more reliable content information from a network representing an implicit language model. Finally, we used vector based attention weights that are applied on context vectors across both time and their individual components. A key accomplishment of this study is that it is the first to incorporate attention directly within the CTC network. Furthermore, we show that our proposed attention-based CTC model, even in the absence of an explicit language model, is able to achieve lower word error rates than a well-trained conventional ASR system equipped with a strong external language model
Discriminative and adaptive training for robust speech recognition and understanding
Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D