Combining pulse-based features for rejecting far-field speech in a HMM-based Voice Activity Detector

Abstract

ABSTRACT 1.-Introduction The advantages of using Automatic Speech Recognition are obvious for several types of applications. Speech Recognition becomes difficult when the main speaker is in noisy environments, for example in bars, where many far-field speakers are speaking almost all the time. This factor contributes to a reduction in the speech recognizer success rate that can lead to an unsatisfactory experience for the user. If there are too many recognition mistakes, the user is forced to correct the system which takes too long, it is a nuisance, and the user will finally reject the system. With the purpose of solving this problem a Robust Voice Activity Detector is proposed in this work. The VAD is able to select speech frames (noise frames are discarded). This frame information is sent to the Speech Recognizer and only speech pronunciations are processed, so the VAD tries to avoid Speech Recognizer mistakes coming from noisy frames. If the VAD works well, the Speech Recognizer does too. In summary, it is very common to find, in mobile phone scenarios, many situations in which the target speaker is situated in open environments surrounded by far-field interfering speech from other speakers. In this ambiguous case, VAD systems can detect far-field speech as coming from the user, increasing the speech recognition error rate. Generally, detection errors caused by background voices mainly increase word insertions and substitutions, leading to significant dialogue misunderstandings. This work tries to solve these speech-based application problems in which far-field speech can be wrongly considered as main speaker speech. In [1] a spectrum sensing scheme to detect the presence of the primary user for cognitive radio systems is proposed (very similar to the VAD proposed in this paper) being able to distinguish between main speaker speech and far-field speech. Moreover the system implemented in In several previous works, similar measurements, like those considered in this work, have been used for dereverberation techniques. I

    Similar works