61 research outputs found
Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition
Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios
Hidden Markov model-based speech enhancement
This work proposes a method of model-based speech enhancement that uses a network of
HMMs to first decode noisy speech and to then synthesise a set of features that enables
a speech production model to reconstruct clean speech. The motivation is to remove the
distortion and residual and musical noises that are associated with conventional filteringbased
methods of speech enhancement.
STRAIGHT forms the speech production model for speech reconstruction and requires
a time-frequency spectral surface, aperiodicity and a fundamental frequency contour.
The technique of HMM-based synthesis is used to create the estimate of the timefrequency
surface, and aperiodicity after the model and state sequence is obtained from
HMM decoding of the input noisy speech. Fundamental frequency were found to be best
estimated using the PEFAC method rather than synthesis from the HMMs.
For the robust HMM decoding in noisy conditions it is necessary for the HMMs
to model noisy speech and consequently noise adaptation is investigated to achieve this
and its resulting effect on the reconstructed speech measured. Even with such noise
adaptation to match the HMMs to the noisy conditions, decoding errors arise, both
in terms of incorrect decoding and time alignment errors. Confidence measures are
developed to identify such errors and then compensation methods developed to conceal
these errors in the enhanced speech signal.
Speech quality and intelligibility analysis is first applied in terms of PESQ and NCM
showing the superiority of the proposed method against conventional methods at low
SNRs. Three way subjective MOS listening test then discovers the performance of the
proposed method overwhelmingly surpass the conventional methods over all noise conditions
and then a subjective word recognition test shows an advantage of the proposed
method over speech intelligibility to the conventional methods at low SNRs
- …