783 research outputs found
Empirical Evaluation of Speaker Adaptation on DNN based Acoustic Model
Speaker adaptation aims to estimate a speaker specific acoustic model from a
speaker independent one to minimize the mismatch between the training and
testing conditions arisen from speaker variabilities. A variety of neural
network adaptation methods have been proposed since deep learning models have
become the main stream. But there still lacks an experimental comparison
between different methods, especially when DNN-based acoustic models have been
advanced greatly. In this paper, we aim to close this gap by providing an
empirical evaluation of three typical speaker adaptation methods: LIN, LHUC and
KLD. Adaptation experiments, with different size of adaptation data, are
conducted on a strong TDNN-LSTM acoustic model. More challengingly, here, the
source and target we are concerned with are standard Mandarin speaker model and
accented Mandarin speaker model. We compare the performances of different
methods and their combinations. Speaker adaptation performance is also examined
by speaker's accent degree.Comment: Interspeech 201
Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model
Multilingual models for Automatic Speech Recognition (ASR) are attractive as
they have been shown to benefit from more training data, and better lend
themselves to adaptation to under-resourced languages. However, initialisation
from monolingual context-dependent models leads to an explosion of
context-dependent states. Connectionist Temporal Classification (CTC) is a
potential solution to this as it performs well with monophone labels.
We investigate multilingual CTC in the context of adaptation and
regularisation techniques that have been shown to be beneficial in more
conventional contexts. The multilingual model is trained to model a universal
International Phonetic Alphabet (IPA)-based phone set using the CTC loss
function. Learning Hidden Unit Contribution (LHUC) is investigated to perform
language adaptive training. In addition, dropout during cross-lingual
adaptation is also studied and tested in order to mitigate the overfitting
problem.
Experiments show that the performance of the universal phoneme-based CTC
system can be improved by applying LHUC and it is extensible to new phonemes
during cross-lingual adaptation. Updating all the parameters shows consistent
improvement on limited data. Applying dropout during adaptation can further
improve the system and achieve competitive performance with Deep Neural Network
/ Hidden Markov Model (DNN/HMM) systems on limited data
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
- …