36,275 research outputs found
Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition
Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios
Increasing Speech Recognition Noise Robustness with HMM2
The purpose of this paper is to investigate the behavior of HMM2 models for the recognition of noisy speech. It has previously been shown that HMM2 is able to model dynamically important structural information inherent in the speech signal, often corresponding to formant positions/tracks. As formant regions are known to be robust in adverse conditions, HMM2 seems particularly promising for improving speech recognition robustness. Here, we review different variants of the HMM2 approach with respect to their application to noise-robust automatic speech recognition. It is shown that HMM2 has the potential to tackle the problem of mismatch between training and testing conditions, and that a multi-stream combination of (already noise-robust) cepstral features and formant-like features (extracted by HMM2) improves the noise robustness of a state-of-the-art automatic speech recognition system
Multiple Timescale Feature Combination towards Robust Speech Recognition
While a lot of progress has been made during the last years in the field of Automatic Speech recognition (ASR), one of the main remaining problems is that of robustness. Typically, state-of-the-art ASR systems work very efficiently in well-defined environments, e.g. for clean speech or known noise conditions. However, their performance degrades drastically under different conditions. Many approaches have been developed to circumvent this problem, ranging from noise cancellation to system adaptation techniques. This paper investigates the influence of using additional information from relatively long timescales to noise robustness. The multiple timescale feature combination approach is introduced. Experiments show that, while maintaining recognition performance for clean speech, robustness could be improved in noisy conditions
Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts
This paper presents a novel algorithm for building an automatic speech
recognition (ASR) model with imperfect training data. Imperfectly transcribed
speech is a prevalent issue in human-annotated speech corpora, which degrades
the performance of ASR models. To address this problem, we propose Bypass
Temporal Classification (BTC) as an expansion of the Connectionist Temporal
Classification (CTC) criterion. BTC explicitly encodes the uncertainties
associated with transcripts during training. This is accomplished by enhancing
the flexibility of the training graph, which is implemented as a weighted
finite-state transducer (WFST) composition. The proposed algorithm improves the
robustness and accuracy of ASR systems, particularly when working with
imprecisely transcribed speech corpora. Our implementation will be
open-sourced
MFCC AND CMN BASED SPEAKER RECOGNITION IN NOISY ENVIRONMENT
The performance of automatic speaker recognition (ASR) system degrades drastically in the presence of noise and other distortions, especially when there is a noise level mismatch between the training and testing environments. This paper explores the problem of speaker recognition in noisy conditions, assuming that speech signals are corrupted by noise. A major problem of most speaker recognition systems is their unsatisfactory performance in noisy environments. In this experimental research, we have studied a combination of Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and Cepstral Mean Normalization (CMN) techniques for speech enhancement. Our system uses a Gaussian Mixture Models (GMM) classifier and is implemented under MATLAB®7 programming environment. The process involves the use of speaker data for both training and testing. The data used for testing is matched up against a speaker model, which is trained with the training data using GMM modeling. Finally, experiments are carried out to test the new model for ASR given limited training data and with differing levels and types of realistic background noise. The results have demonstrated the robustness of the new system
Code-Switched Urdu ASR for Noisy Telephonic Environment using Data Centric Approach with Hybrid HMM and CNN-TDNN
Call Centers have huge amount of audio data which can be used for achieving
valuable business insights and transcription of phone calls is manually tedious
task. An effective Automated Speech Recognition system can accurately
transcribe these calls for easy search through call history for specific
context and content allowing automatic call monitoring, improving QoS through
keyword search and sentiment analysis. ASR for Call Center requires more
robustness as telephonic environment are generally noisy. Moreover, there are
many low-resourced languages that are on verge of extinction which can be
preserved with help of Automatic Speech Recognition Technology. Urdu is the
most widely spoken language in the world, with 231,295,440 worldwide
still remains a resource constrained language in ASR. Regional call-center
conversations operate in local language, with a mix of English numbers and
technical terms generally causing a "code-switching" problem. Hence, this paper
describes an implementation framework of a resource efficient Automatic Speech
Recognition/ Speech to Text System in a noisy call-center environment using
Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. Using Hybrid
HMM-DNN approach allowed us to utilize the advantages of Neural Network with
less labelled data. Adding CNN with TDNN has shown to work better in noisy
environment due to CNN's additional frequency dimension which captures extra
information from noisy speech, thus improving accuracy. We collected data from
various open sources and labelled some of the unlabelled data after analysing
its general context and content from Urdu language as well as from commonly
used words from other languages, primarily English and were able to achieve WER
of 5.2% with noisy as well as clean environment in isolated words or numbers as
well as in continuous spontaneous speech.Comment: 32 pages, 19 figures, 2 tables, preprin
- …