589 research outputs found
Environmentally robust ASR front-end for deep neural network acoustic models
This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00
Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews
In automatic speech recognition, often little training data is available for
specific challenging tasks, but training of state-of-the-art automatic speech
recognition systems requires large amounts of annotated speech. To address this
issue, we propose a two-staged approach to acoustic modeling that combines
noise and reverberation data augmentation with transfer learning to robustly
address challenges such as difficult acoustic recording conditions, spontaneous
speech, and speech of elderly people. We evaluate our approach using the
example of German oral history interviews, where a relative average reduction
of the word error rate by 19.3% is achieved.Comment: Accepted for IEEE International Conference on Multimedia and Expo
(ICME), Shanghai, China, July 201
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Auditory processing-based features for improving speech recognition in adverse acoustic conditions
n/
Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition
We investigate the use of generative adversarial networks (GANs) in speech
dereverberation for robust speech recognition. GANs have been recently studied
for speech enhancement to remove additive noises, but there still lacks of a
work to examine their ability in speech dereverberation and the advantages of
using GANs have not been fully established. In this paper, we provide deep
investigations in the use of GAN-based dereverberation front-end in ASR. First,
we study the effectiveness of different dereverberation networks (the generator
in GAN) and find that LSTM leads a significant improvement as compared with
feed-forward DNN and CNN in our dataset. Second, further adding residual
connections in the deep LSTMs can boost the performance as well. Finally, we
find that, for the success of GAN, it is important to update the generator and
the discriminator using the same mini-batch data during training. Moreover,
using reverberant spectrogram as a condition to discriminator, as suggested in
previous studies, may degrade the performance. In summary, our GAN-based
dereverberation front-end achieves 14%-19% relative CER reduction as compared
to the baseline DNN dereverberation network when tested on a strong
multi-condition training acoustic model.Comment: Interspeech 201
- …