16 research outputs found
Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model
Speaker adaptation aims to estimate a speaker specific acoustic model from a
speaker independent one to minimize the mismatch between the training and
testing conditions arisen from speaker variabilities. A variety of neural
network adaptation methods have been proposed since deep learning models have
become the main stream. But there still lacks an experimental comparison
between different methods, especially when DNN-based acoustic models have been
advanced greatly. In this paper, we aim to close this gap by providing an
empirical evaluation of three typical speaker adaptation methods: LIN, LHUC and
KLD. Adaptation experiments, with different size of adaptation data, are
conducted on a strong TDNN-LSTM acoustic model. More challengingly, here, the
source and target we are concerned with are standard Mandarin speaker model and
accented Mandarin speaker model. We compare the performances of different
methods and their combinations. Speaker adaptation performance is also examined
by speaker's accent degree.Comment: Interspeech 201
Learning representations for speech recognition using artificial neural networks
Learning representations is a central challenge in machine learning. For speech
recognition, we are interested in learning robust representations that are stable
across different acoustic environments, recording equipment and irrelevant inter–
and intra– speaker variabilities. This thesis is concerned with representation
learning for acoustic model adaptation to speakers and environments, construction
of acoustic models in low-resource settings, and learning representations from
multiple acoustic channels. The investigations are primarily focused on the hybrid
approach to acoustic modelling based on hidden Markov models and artificial
neural networks (ANN).
The first contribution concerns acoustic model adaptation. This comprises
two new adaptation transforms operating in ANN parameters space. Both operate
at the level of activation functions and treat a trained ANN acoustic model as
a canonical set of fixed-basis functions, from which one can later derive variants
tailored to the specific distribution present in adaptation data. The first technique,
termed Learning Hidden Unit Contributions (LHUC), depends on learning
distribution-dependent linear combination coefficients for hidden units. This
technique is then extended to altering groups of hidden units with parametric and
differentiable pooling operators. We found the proposed adaptation techniques
pose many desirable properties: they are relatively low-dimensional, do not overfit
and can work in both a supervised and an unsupervised manner. For LHUC we
also present extensions to speaker adaptive training and environment factorisation.
On average, depending on the characteristics of the test set, 5-25% relative
word error rate (WERR) reductions are obtained in an unsupervised two-pass
adaptation setting.
The second contribution concerns building acoustic models in low-resource
data scenarios. In particular, we are concerned with insufficient amounts of
transcribed acoustic material for estimating acoustic models in the target language
– thus assuming resources like lexicons or texts to estimate language models
are available. First we proposed an ANN with a structured output layer
which models both context–dependent and context–independent speech units,
with the context-independent predictions used at runtime to aid the prediction
of context-dependent states. We also propose to perform multi-task adaptation
with a structured output layer. We obtain consistent WERR reductions up to
6.4% in low-resource speaker-independent acoustic modelling. Adapting those
models in a multi-task manner with LHUC decreases WERRs by an additional
13.6%, compared to 12.7% for non multi-task LHUC. We then demonstrate that
one can build better acoustic models with unsupervised multi– and cross– lingual
initialisation and find that pre-training is a largely language-independent. Up to
14.4% WERR reductions are observed, depending on the amount of the available
transcribed acoustic data in the target language.
The third contribution concerns building acoustic models from multi-channel
acoustic data. For this purpose we investigate various ways of integrating and
learning multi-channel representations. In particular, we investigate channel concatenation
and the applicability of convolutional layers for this purpose. We
propose a multi-channel convolutional layer with cross-channel pooling, which
can be seen as a data-driven non-parametric auditory attention mechanism. We
find that for unconstrained microphone arrays, our approach is able to match the
performance of the comparable models trained on beamform-enhanced signals
A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions
International audienceRobustness to reverberation is a key concern for distant-microphone ASR. Various approaches have been proposed, including single-channel or multichannel dereverberation, robust feature extraction, alternative acoustic models, and acoustic model adaptation. However, to the best of our knowledge, a detailed study of these techniques in varied reverberation conditions is still missing in the literature. In this paper, we conduct a series of experiments to assess the impact of various dereverberation and acoustic model adaptation approaches on the ASR performance in the range of reverberation conditions found in real domestic environments. We consider both established approaches such as WPE and newer approaches such as learning hidden unit contribution (LHUC) adaptations, whose performance has not been reported before in this context, and we employ them in combination. Our results indicate that performing weighted prediction error (WPE) dereverberation on a reverberated test speech utterance and decoding using an deep neural network (DNN) acoustic model trained with multi-condition reverberated speech with feature-space maximum likelihood linear regression (fMLLR) transformed features, outperforms more recent approaches and helps significantly reduce the word error rate (WER)
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure