77,528 research outputs found
Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates
Audio-visual speech recognition is a promising approach to tackling the problem of reduced recognition rates under adverse acoustic conditions. However, finding an optimal mechanism for combining multi-modal information remains a challenging task. Various methods are applicable for integrating acoustic and visual information in Gaussian-mixture-model-based speech recognition, e.g., via dynamic stream weighting. The recent advances of deep neural network (DNN)-based speech recognition promise improved performance when using audio-visual information. However, the question of how to optimally integrate acoustic and visual information remains. In this paper, we propose a state-based integration scheme that uses dynamic stream weights in DNN-based audio-visual speech recognition. The dynamic weights are obtained from a time-variant reliability estimate that is derived from the audio signal. We show that this state-based integration is superior to early integration of multi-modal features, even if early integration also includes the proposed reliability estimate. Furthermore, the proposed adaptive mechanism is able to outperform a fixed weighting approach that exploits oracle knowledge of the true signal-to-noise ratio
Leveraging native language information for improved accented speech recognition
Recognition of accented speech is a long-standing challenge for automatic
speech recognition (ASR) systems, given the increasing worldwide population of
bi-lingual speakers with English as their second language. If we consider
foreign-accented speech as an interpolation of the native language (L1) and
English (L2), using a model that can simultaneously address both languages
would perform better at the acoustic level for accented speech. In this study,
we explore how an end-to-end recurrent neural network (RNN) trained system with
English and native languages (Spanish and Indian languages) could leverage data
of native languages to improve performance for accented English speech. To this
end, we examine pre-training with native languages, as well as multi-task
learning (MTL) in which the main task is trained with native English and the
secondary task is trained with Spanish or Indian Languages. We show that the
proposed MTL model performs better than the pre-training approach and
outperforms a baseline model trained simply with English data. We suggest a new
setting for MTL in which the secondary task is trained with both English and
the native language, using the same output set. This proposed scenario yields
better performance with +11.95% and +17.55% character error rate gains over
baseline for Hispanic and Indian accents, respectively.Comment: Accepted at Interspeech 201
Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning
One of the major challenges in acoustic modelling of child speech is the
rapid changes that occur in the children's articulators as they grow up, their
differing growth rates and the subsequent high variability in the same age
group. These high acoustic variations along with the scarcity of child speech
corpora have impeded the development of a reliable speech recognition system
for children. In this paper, a speaker- and age-invariant training approach
based on adversarial multi-task learning is proposed. The system consists of
one generator shared network that learns to generate speaker- and age-invariant
features connected to three discrimination networks, for phoneme, age, and
speaker. The generator network is trained to minimize the
phoneme-discrimination loss and maximize the speaker- and age-discrimination
losses in an adversarial multi-task learning fashion. The generator network is
a Time Delay Neural Network (TDNN) architecture while the three discriminators
are feed-forward networks. The system was applied to the OGI speech corpora and
achieved a 13% reduction in the WER of the ASR.Comment: Submitted to ICASSP202
- …