4 research outputs found
On dynamic stream weighting for Audio-Visual Speech Recognition
The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper we propose a new confidence measure, compare it to existing ones and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures
Audio-visual speaker separation
Communication using speech is often an audio-visual experience. Listeners hear what is
being uttered by speakers and also see the corresponding facial movements and other gestures.
This thesis is an attempt to exploit this bimodal (audio-visual) nature of speech for
speaker separation. In addition to the audio speech features, visual speech features are used
to achieve the task of speaker separation. An analysis of the correlation between audio and
visual speech features is carried out first. This correlation between audio and visual features
is then used in the estimation of clean audio features from visual features using Gaussian
MixtureModels (GMMs) andMaximum a Posteriori (MAP) estimation.
For speaker separation three methods are proposed that use the estimated clean audio features.
Firstly, the estimated clean audio features are used to construct aWiener filter to separate
the mixed speech at various signal-to-noise ratios (SNRs) into target and competing
speakers. TheWiener filter gains are modified in several ways in search for improvements in
quality and intelligibility of the extracted speech. Secondly, the estimated clean audio features
are used in developing visually-derived binary masking method for speaker separation.
The estimated audio features are used to compute time-frequency binary masks that identify
the regions where the target speaker dominates. These regions are retained and formthe
estimate of the target speaker’s speech. Experimental results compare the visually-derived
binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness
of the visually-derived binary mask for speaker separation is then evaluated through
estimates of speech quality and speech intelligibility and shows substantial gains over the
original mixture. Thirdly, the estimated clean audio features and the visually-derivedWiener
filtering are used to modify the operation of an effective audio-only method of speaker separation,
namely the soft mask method, to allow visual speech information to improve the
separation task. Experimental results are presented that compare the proposed audio-visual
speaker separation with the audio-only method using both speech quality and intelligibility
metrics. Finally, a detailed comparison is made of the proposed and existing methods of
speaker separation using objective and subjective measures