97 research outputs found
Robust excitation-based features for Automatic Speech Recognition
In this paper we investigate the use of robust to noise features characterizing the speech excitation signal as complementary features to the usually considered vocal tract based features for automatic speech recognition (ASR). The features are tested in a state-of-the-art Deep Neural Network (DNN) based hybrid acoustic model for speech recognition. The suggested excitation features expands the set of excitation features previously considered for ASR, expecting that these features help in a better discrimination of the broad phonetic classes (e.g., fricatives, nasal, vowels, etc.). Relative improvements in the word error rate are observed in the AMI meeting transcription system with greater gains (about 5%) if PLP features are combined with the suggested excitation features. For Aurora 4, significant improvements are observed as well. Combining the suggested excitation features with filter banks, a word error rate of 9.96% is achieved.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ICASSP.2015.717885
RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain
Despite the recent advancements in speech recognition, there are still
difficulties in accurately transcribing conversational and emotional speech in
noisy and reverberant acoustic environments. This poses a particular challenge
in the search and rescue (SAR) domain, where transcribing conversations among
rescue team members is crucial to support real-time decision-making. The
scarcity of speech data and associated background noise in SAR scenarios make
it difficult to deploy robust speech recognition systems. To address this
issue, we have created and made publicly available a German speech dataset
called RescueSpeech. This dataset includes real speech recordings from
simulated rescue exercises. Additionally, we have released competitive training
recipes and pre-trained models. Our study highlights that the performance
attained by state-of-the-art methods in this challenging scenario is still far
from reaching an acceptable level
An Investigation into Speaker Informed DNN Front-end for LVCSR
Deep Neural Network (DNN) has become a standard method in many ASR tasks. Recently there is considerable interest in "informed training" of DNNs, where DNN input is augmented with auxiliary codes, such as i-vectors, speaker codes, speaker separation bottleneck (SSBN) features, etc. This paper compares different speaker informed DNN training methods in LVCSR task. We discuss mathematical equivalence between speaker informed DNN training and "bias adaptation" which uses speaker dependent biases, and give detailed analysis on influential factors such as dimension, discrimination and stability of auxiliary codes. The analysis is supported by experiments on a meeting recognition task using bottleneck feature based system. Results show that i-vector based adaptation is also effective in bottleneck feature based system (not just hybrid systems). However all tested methods show poor generalisation to unseen speakers. We introduce a system based on speaker classification followed by speaker adaptation of biases, which yields equivalent performance to an i-vector based system with 10.4% relative improvement over baseline on seen speakers. The new approach can serve as a fast alternative especially for short utterances
- …